Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Birbal: An efficient 7B instruct-model fine-tuned with curated datasets
Ashvini Kumar Jindal, Pawan Kumar Rajpoot, Ankur Parikh
REAL-Colon: A dataset for developing real-world AI applications in colonoscopy
Carlo Biffi, Giulio Antonelli, Sebastian Bernhofer, Cesare Hassan, Daizen Hirata, Mineo Iwatate, Andreas Maieron, Pietro Salvagnini, Andrea Cherubini
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
Ander Salaberria, Gorka Azkune, Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre, Frank Keller
Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking
Nathan Gavenski, Michael Luck, Odinaldo Rodrigues
ARED: Argentina Real Estate Dataset
Iván Belenky
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, Qi Liu
Trained Random Forests Completely Reveal your Dataset
Julien Ferry, Ricardo Fukasawa, Timothée Pascal, Thibaut Vidal
DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments
Ji Ma, Hongming Dai, Yao Mu, Pengying Wu, Hao Wang, Xiaowei Chi, Yang Fei, Shanghang Zhang, Chang Liu
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents
Corby Rosset, Ho-Lam Chung, Guanghui Qin, Ethan C. Chau, Zhuo Feng, Ahmed Awadallah, Jennifer Neville, Nikhil Rao
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry
Michael Toker, Oren Mishali, Ophir Münz-Manor, Benny Kimelfeld, Yonatan Belinkov
DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram
United We Pretrain, Divided We Fail! Representation Learning for Time Series by Pretraining on 75 Datasets at Once
Maurice Kraus, Felix Divo, David Steinmann, Devendra Singh Dhami, Kristian Kersting