Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets
Nassim Oufattole, Teya Bergamaschi, Aleksia Kolo, Hyewon Jeong, Hanna Gaggin, Collin M. Stultz, Matthew B.A. McDermott
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking
Pranav Singh Chib, Pravendra Singh
AIDOVECL: AI-generated Dataset of Outpainted Vehicles for Eye-level Classification and Localization
Amir Kazemi, Qurat ul ain Fatima, Volodymyr Kindratenko, Christopher Tessum
GS-Blur: A 3D Scene-Based Dataset for Realistic Image Deblurring
Dongwoo Lee, Joonkyu Park, Kyoung Mu Lee
How Do Flow Matching Models Memorize and Generalize in Sample Data Subspaces?
Weiguo Gao, Ming Li
Emory Knee Radiograph (MRKR) Dataset
Brandon Price, Jason Adleberg, Kaesha Thomas, Zach Zaiman, Aawez Mansuri, Beatrice Brown-Mulry, Chima Okecheukwu, Judy Gichoya, Hari Trivedi
Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms
Jordan Meyer, Nick Padgett, Cullen Miller, Laura Exline
High-Fidelity Document Stain Removal via A Large-Scale Real-World Dataset and A Memory-Augmented Transformer
Mingxian Li, Hao Sun, Yingtie Lei, Xiaofeng Zhang, Yihang Dong, Yilin Zhou, Zimeng Li, Xuhang Chen
Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets
Andoni Cortés, Clemente Rodríguez, Gorka Velez, Javier Barandiarán, Marcos Nieto
Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset
Adrian Garret Gabriel, Alaa Alameer Ahmad, Shankar Kumar Jeyakumar
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, Teruko Mitamura
Topic-Conversation Relevance (TCR) Dataset and Benchmarks
Yaran Fan, Jamie Pool, Senja Filipi, Ross Cutler
CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions
Spyridon Kantarelis, Konstantinos Thomas, Vassilis Lyberatos, Edmund Dervakos, Giorgos Stamou