Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, Teruko Mitamura
Topic-Conversation Relevance (TCR) Dataset and Benchmarks
Yaran Fan, Jamie Pool, Senja Filipi, Ross Cutler
CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions
Spyridon Kantarelis, Konstantinos Thomas, Vassilis Lyberatos, Edmund Dervakos, Giorgos Stamou
AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context
Naba Rizvi, Harper Strickland, Daniel Gitelman, Tristan Cooper, Alexis Morales-Flores, Michael Golden, Aekta Kallepalli, Akshat Alurkar, Haaset Owens, Saleha Ahmedi, Isha Khirwadkar, Imani Munyaka, Nedjma Ousidhoum
Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?
Lingao Xiao, Yang He
XAI-FUNGI: Dataset resulting from the user study on comprehensibility of explainable AI algorithms
Szymon Bobek, Paloma Korycińska, Monika Krakowska, Maciej Mozolewski, Dorota Rak, Magdalena Zych, Magdalena Wójcik, Grzegorz J. Nalepa
DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?
Urja Khurana, Eric Nalisnick, Antske Fokkens
Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods
Adam Phillips (1), Daniel Grandes Rodriguez (1), Miriam Sánchez-Manzano (1), Alan Salvadó (1), Manuel Garin (1), Gloria Haro (1), Coloma Ballester (1) ((1) Universitat Pompeu Fabra, Barcelona, Spain)
Designing a Dataset for Convolutional Neural Networks to Predict Space Groups Consistent with Extinction Laws
Hao Wang, Jiajun Zhong, Yikun Li, Junrong Zhang, Rong Du
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs
Jiatan Huang, Mingchen Li, Zonghai Yao, Zhichao Yang, Yongkang Xiao, Feiyun Ouyang, Xiaohan Li, Shuo Han, Hong Yu
Comparing Surface Landmine Object Detection Models on a New Drone Flyby Dataset
Navin Agrawal-Chung, Zohran Moin
BQA: Body Language Question Answering Dataset for Video Large Language Models
Shintaro Ozaki, Kazuki Hayashi, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe