Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset
Panagiota Anastasopoulou, Jessica Torrey, Xavier Serra, Frederic Font
RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations
Kaichen Zhou, Yang Cao, Taewhan Kim, Hao Zhao, Hao Dong, Kai Ming Ting, Ye Zhu
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset
Xiao Wang, Fuling Wang, Yuehang Li, Qingchuan Ma, Shiao Wang, Bo Jiang, Chuanfu Li, Jin Tang
GalaxiesML: a dataset of galaxy images, photometry, redshifts, and structural parameters for machine learning
Tuan Do (1), Bernie Boscoe (2), Evan Jones (1), Yun Qi Li (1, 3), Kevin Alfaro (1) ((1) UCLA, (2) Southern Oregon University, (3) University of Washington)
TaskComplexity: A Dataset for Task Complexity Classification with In-Context Learning, FLAN-T5 and GPT-4o Benchmarks
Areeg Fahad Rasheed, M. Zarkoosh, Safa F. Abbas, Sana Sabah Al-Azzawi
CycleCrash: A Dataset of Bicycle Collision Videos for Collision Prediction and Analysis
Nishq Poorav Desai, Ali Etemad, Michael Greenspan
A Systematic Review of NLP for Dementia- Tasks, Datasets and Opportunities
Lotem Peled-Cohen, Roi Reichart
Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models
Xin Li, Weize Chen, Qizhi Chu, Haopeng Li, Zhaojun Sun, Ran Li, Chen Qian, Yiwei Wei, Zhiyuan Liu, Chuan Shi, Maosong Sun, Cheng Yang
LML-DAP: Language Model Learning a Dataset for Data-Augmented Prediction
Praneeth Vadlapati
Excavating in the Wild: The GOOSE-Ex Dataset for Semantic Segmentation
Raphael Hagmanns, Peter Mortimer, Miguel Granero, Thorsten Luettel, Janko Petereit
Relighting from a Single Image: Datasets and Deep Intrinsic-based Architecture
Yixiong Yang, Hassan Ahmed Sial, Ramon Baldrich, Maria Vanrell
Off to new Shores: A Dataset & Benchmark for (near-)coastal Flood Inundation Forecasting
Brandon Victor, Mathilde Letard, Peter Naylor, Karim Douch, Nicolas Longépé, Zhen He, Patrick Ebel
MMDVS-LF: A Multi-Modal Dynamic-Vision-Sensor Line Following Dataset
Felix Resch, Mónika Farsang, Radu Grosu
Geospatial Road Cycling Race Results Data Set
Bram Janssens, Luca Pappalardo, Jelle De Bock, Matthias Bogaert, Steven Verstockt
Dataset Distillation-based Hybrid Federated Learning on Non-IID Data
Xiufang Shi, Wei Zhang, Mincheng Wu, Guangyi Liu, Zhenyu Wen, Shibo He, Tejal Shah, Rajiv Ranjan