Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Larth: Dataset and Machine Translation for Etruscan
Gianluca Vico, Gerasimos Spanakis
InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations
Nils Feldhus, Qianli Wang, Tatiana Anikina, Sahil Chopra, Cennet Oguz, Sebastian Möller
Divide and Ensemble: Progressively Learning for the Unknown
Hu Zhang, Xin Shen, Heming Du, Huiqiang Chen, Chen Liu, Hongwei Sheng, Qingzheng Xu, MD Wahiduzzaman Khan, Qingtao Yu, Tianqing Zhu, Scott Chapman, Zi Huang, Xin Yu
MEMO: Dataset and Methods for Robust Multimodal Retinal Image Registration with Large or Small Vessel Density Differences
Chiao-Yi Wang, Faranguisse Kakhi Sadrieh, Yi-Ting Shen, Shih-En Chen, Sarah Kim, Victoria Chen, Achyut Raghavendra, Dongyi Wang, Osamah Saeedi, Yang Tao
Species196: A One-Million Semi-supervised Dataset for Fine-grained Species Recognition
Wei He, Kai Han, Ying Nie, Chengcheng Wang, Yunhe Wang
DIOR: Dataset for Indoor-Outdoor Reidentification -- Long Range 3D/2D Skeleton Gait Collection Pipeline, Semi-Automated Gait Keypoint Labeling and Baseline Evaluation Methods
Yuyang Chen, Praveen Raj Masilamani, Bhavin Jawade, Srirangaraj Setlur, Karthik Dantu
The Cambridge Law Corpus: A Dataset for Legal AI Research
Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek
Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches
Deepak Gupta, Kush Attal, Dina Demner-Fushman
Choice-75: A Dataset on Decision Branching in Script Learning
Zhaoyi Joey Hou, Li Zhang, Chris Callison-Burch