Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
GDTM: An Indoor Geospatial Tracking Dataset with Distributed Multimodal Sensors
Ho Lyun Jeong, Ziqi Wang, Colin Samplawski, Jason Wu, Shiwei Fang, Lance M. Kaplan, Deepak Ganesan, Benjamin Marlin, Mani Srivastava
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
Ashutosh Sathe, Prachi Jain, Sunayana Sitaram
DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing
Haneul Yoo, Jieun Han, So-Yeon Ahn, Alice Oh
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, Igor Gitman
A chaotic maps-based privacy-preserving distributed deep learning for incomplete and Non-IID datasets
Irina Arévalo, Jose L. Salmeron
A Dataset of Open-Domain Question Answering with Multiple-Span Answers
Zhiyi Luo, Yingying Zhang, Shuyun Luo, Ying Zhao, Wentao Lyu
BUSTER: a "BUSiness Transaction Entity Recognition" dataset
Andrea Zugarini, Andrew Zamai, Marco Ernandes, Leonardo Rigutini