Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Evaluating Model Performance in Medical Datasets Over Time
Helen Zhou, Yuwen Chen, Zachary C. Lipton
AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web
Michael Schlichtkrull, Zhijiang Guo, Andreas Vlachos
Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection
Debarpan Bhattacharya, Neeraj Kumar Sharma, Debottam Dutta, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K K, Sadhana Gonuguntla, Murali Alagesan
llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology
Masanori Hirano, Masahiro Suzuki, Hiroki Sakaji
DATED: Guidelines for Creating Synthetic Datasets for Engineering Design Applications
Cyril Picard, Jürg Schiffmann, Faez Ahmed
Benchmarking UWB-Based Infrastructure-Free Positioning and Multi-Robot Relative Localization: Dataset and Characterization
Paola Torrico Morón, Sahar Salimpour, Lei Fu, Xianjia Yu, Jorge Peña Queralta, Tomi Westerlund
Document Understanding Dataset and Evaluation (DUDE)
Jordy Van Landeghem, Rubén Tito, Łukasz Borchmann, Michał Pietruszka, Paweł Józiak, Rafał Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Ackaert, Ernest Valveny, Matthew Blaschko, Sien Moens, Tomasz Stanisławek
CLImage: Human-Annotated Datasets for Complementary-Label Learning
Hsiu-Hsuan Wang, Tan-Ha Mai, Nai-Xuan Ye, Wei-I Lin, Hsuan-Tien Lin
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine
Jie Xu, Lu Lu, Sen Yang, Bilin Liang, Xinwei Peng, Jiali Pang, Jinru Ding, Xiaoming Shi, Lingrui Yang, Huan Song, Kang Li, Xin Sun, Shaoting Zhang
Open-WikiTable: Dataset for Open Domain Question Answering with Complex Reasoning over Table
Sunjun Kweon, Yeonsu Kwon, Seonhee Cho, Yohan Jo, Edward Choi