Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking
Nathan Gavenski, Michael Luck, Odinaldo Rodrigues
ARED: Argentina Real Estate Dataset
Iván Belenky
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, Qi Liu
Trained Random Forests Completely Reveal your Dataset
Julien Ferry, Ricardo Fukasawa, Timothée Pascal, Thibaut Vidal
DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments
Ji Ma, Hongming Dai, Yao Mu, Pengying Wu, Hao Wang, Xiaowei Chi, Yang Fei, Shanghang Zhang, Chang Liu
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents
Corby Rosset, Ho-Lam Chung, Guanghui Qin, Ethan C. Chau, Zhuo Feng, Ahmed Awadallah, Jennifer Neville, Nikhil Rao
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov
A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry
Michael Toker, Oren Mishali, Ophir Münz-Manor, Benny Kimelfeld, Yonatan Belinkov
DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram
United We Pretrain, Divided We Fail! Representation Learning for Time Series by Pretraining on 75 Datasets at Once
Maurice Kraus, Felix Divo, David Steinmann, Devendra Singh Dhami, Kristian Kersting
$\textit{L+M-24}$: Building a Dataset for Language + Molecules @ ACL 2024
Carl Edwards, Qingyun Wang, Lawrence Zhao, Heng Ji
A Self-supervised Pressure Map human keypoint Detection Approch: Optimizing Generalization and Computational Efficiency Across Datasets
Chengzhang Yu, Xianjun Yang, Wenxia Bao, Shaonan Wang, Zhiming Yao
GDTM: An Indoor Geospatial Tracking Dataset with Distributed Multimodal Sensors
Ho Lyun Jeong, Ziqi Wang, Colin Samplawski, Jason Wu, Shiwei Fang, Lance M. Kaplan, Deepak Ganesan, Benjamin Marlin, Mani Srivastava
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
Ashutosh Sathe, Prachi Jain, Sunayana Sitaram
DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing
Haneul Yoo, Jieun Han, So-Yeon Ahn, Alice Oh