Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
WCLD: Curated Large Dataset of Criminal Cases from Wisconsin Circuit Courts
Elliott Ash, Naman Goel, Nianyun Li, Claudia Marangon, Peiyao Sun
EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric I-Chao Chang, Tackeun Kim, Edward Choi
Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses
Elena Sizikova, Niloufar Saharkhiz, Diksha Sharma, Miguel Lago, Berkman Sahiner, Jana G. Delfino, Aldo Badano
SDOH-NLI: a Dataset for Inferring Social Determinants of Health from Clinical Notes
Adam D. Lelkes, Eric Loreaux, Tal Schuster, Ming-Jun Chen, Alvin Rajkomar
OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification
Dhiman Goswami, Md Nishat Raihan, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri
SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis
Md Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri
A Dataset of Relighted 3D Interacting Hands
Gyeongsik Moon, Shunsuke Saito, Weipeng Xu, Rohan Joshi, Julia Buffalini, Harley Bellan, Nicholas Rosen, Jesse Richardson, Mallorie Mize, Philippe de Bree, Tomas Simon, Bo Peng, Shubham Garg, Kevyn McPhail, Takaaki Shiratori
IndustReal: A Dataset for Procedure Step Recognition Handling Execution Errors in Egocentric Videos in an Industrial-Like Setting
Tim J. Schoonbeek, Tim Houben, Hans Onvlee, Peter H. N. de With, Fons van der Sommen
MLFMF: Data Sets for Machine Learning for Mathematical Formalization
Andrej Bauer, Matej Petković, Ljupčo Todorovski
NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes
Junda Wang, Zonghai Yao, Zhichao Yang, Huixue Zhou, Rumeng Li, Xun Wang, Yucheng Xu, Hong Yu
This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Iker García-Ferrero, Begoña Altuna, Javier Álvez, Itziar Gonzalez-Dios, German Rigau
VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights
Shanshan Xu, Leon Staufer, T. Y. S. S Santosh, Oana Ichim, Corina Heri, Matthias Grabmair
A State-Vector Framework for Dataset Effects
Esmat Sahak, Zining Zhu, Frank Rudzicz
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Yangyang Guo, Fangkai Jiao, Zhiqi Shen, Liqiang Nie, Mohan Kankanhalli
Intent Detection and Slot Filling for Home Assistants: Dataset and Analysis for Bangla and Sylheti
Fardin Ahsan Sakib, A H M Rezaul Karim, Saadat Hasan Khan, Md Mushfiqur Rahman