Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
MathWriting: A Dataset For Handwritten Mathematical Expression Recognition
Philippe Gervais, Asya Fadeeva, Andrii Maksai
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
CNN-based explanation ensembling for dataset, representation and explanations evaluation
Weronika Hryniewska-Guzik, Luca Longo, Przemysław Biecek
AI Competitions and Benchmarks: Dataset Development
Romain Egele, Julio C. S. Jacques Junior, Jan N. van Rijn, Isabelle Guyon, Xavier Baró, Albert Clapés, Prasanna Balaprakash, Sergio Escalera, Thomas Moeslund, Jun Wan
RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization
Avinash Anand, Raj Jaiswal, Mohit Gupta, Siddhesh S Bangar, Pijush Bhuyan, Naman Lal, Rajeev Singh, Ritika Jha, Rajiv Ratn Shah, Shin'ichi Satoh
VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain
Khai Le-Duc
Alljoined1 -- A dataset for EEG-to-Image decoding
Jonathan Xu, Bruno Aristimunha, Max Emanuel Feucht, Emma Qian, Charles Liu, Tazik Shahjahan, Martyna Spyra, Steven Zifan Zhang, Nicholas Short, Jioh Kim, Paula Perdomo, Ricky Renfeng Mao, Yashvir Sabharwal, Michael Ahedor Moaz Shoura, Adrian Nestor
A parameter-free clustering algorithm for missing datasets
Qi Li, Xianjun Zeng, Shuliang Wang, Wenhao Zhu, Shijie Ruan, Zhimeng Yuan
The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos
Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira
A Repository for Formal Contexts
Tom Hanika, Robert Jäschke
A Dataset for Physical and Abstract Plausibility and Sources of Human Disagreement
Annerose Eichel, Sabine Schulte im Walde
Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning
Joan Giner-Miguelez, Abel Gómez, Jordi Cabot
A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation
Yin Li, Qi Chen, Kai Wang, Meige Li, Liping Si, Yingwei Guo, Yu Xiong, Qixing Wang, Yang Qin, Ling Xu, Patrick van der Smagt, Jun Tang, Nutan Chen