Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Chat2Scenario: Scenario Extraction From Dataset Through Utilization of Large Language Model
Yongqi Zhao, Wenbo Xiao, Tomislav Mihalj, Jia Hu, Arno Eichberger
3D Human Pose Estimation with Occlusions: Introducing BlendMimic3D Dataset and GCN Refinement
Filipa Lino, Carlos Santiago, Manuel Marques
Rethinking Model Prototyping through the MedMNIST+ Dataset Collection
Sebastian Doerrich, Francesco Di Salvo, Julius Brockmann, Christian Ledig
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig
TeamTrack: A Dataset for Multi-Sport Multi-Object Tracking in Full-pitch Videos
Atom Scott, Ikuma Uchida, Ning Ding, Rikuhei Umemoto, Rory Bunker, Ren Kobayashi, Takeshi Koyama, Masaki Onishi, Yoshinari Kameda, Keisuke Fujii
Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media Data
Oana Ignat, Gayathri Ganesh Lakshmy, Rada Mihalcea
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, Soyeon Caren Han
iTBLS: A Dataset of Interactive Conversations Over Tabular Information
Anirudh Sundar, Christopher Richardson, William Gay, Larry Heck
MathWriting: A Dataset For Handwritten Mathematical Expression Recognition
Philippe Gervais, Asya Fadeeva, Andrii Maksai
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
CNN-based explanation ensembling for dataset, representation and explanations evaluation
Weronika Hryniewska-Guzik, Luca Longo, Przemysław Biecek
AI Competitions and Benchmarks: Dataset Development
Romain Egele, Julio C. S. Jacques Junior, Jan N. van Rijn, Isabelle Guyon, Xavier Baró, Albert Clapés, Prasanna Balaprakash, Sergio Escalera, Thomas Moeslund, Jun Wan
RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization
Avinash Anand, Raj Jaiswal, Mohit Gupta, Siddhesh S Bangar, Pijush Bhuyan, Naman Lal, Rajeev Singh, Ritika Jha, Rajiv Ratn Shah, Shin'ichi Satoh