Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Machine Learning for Shipwreck Segmentation from Side Scan Sonar Imagery: Dataset and Benchmark
Advaith V. Sethuraman, Anja Sheppard, Onur Bagoren, Christopher Pinnow, Jamey Anderson, Timothy C. Havens, Katherine A. Skinner
CloudTracks: A Dataset for Localizing Ship Tracks in Satellite Images of Clouds
Muhammad Ahmed Chaudhry, Lyna Kim, Jeremy Irvin, Yuzu Ido, Sonia Chu, Jared Thomas Isobe, Andrew Y. Ng, Duncan Watson-Parris
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle Perception
Spencer Carmichael, Austin Buchan, Mani Ramanagopal, Radhika Ravi, Ram Vasudevan, Katherine A. Skinner
Lessons on Datasets and Paradigms in Machine Learning for Symbolic Computation: A Case Study on CAD
Tereso del Río, Matthew England
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
DDI-CoCo: A Dataset For Understanding The Effect Of Color Contrast In Machine-Assisted Skin Disease Detection
Ming-Chang Chiu, Yingfei Wang, Yen-Ju Kuo, Pin-Yu Chen
Analyzing and Mitigating Bias for Vulnerable Classes: Towards Balanced Representation in Dataset
Dewant Katare, David Solans Noguero, Souneil Park, Nicolas Kourtellis, Marijn Janssen, Aaron Yi Ding
On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning
Joan Giner-Miguelez, Abel Gómez, Jordi Cabot
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara