Data Set
Datasets are crucial for training and evaluating machine learning models, particularly in areas like natural language processing, computer vision, and audio analysis. Current research emphasizes creating diverse and high-quality datasets addressing specific challenges, such as data imbalance, cross-lingual inconsistencies, and the need for realistic representations of real-world scenarios. This involves developing novel annotation techniques, incorporating multiple data modalities (e.g., text, images, audio), and employing various model architectures (e.g., transformers, convolutional neural networks) for analysis and benchmark creation. The availability of well-designed datasets directly impacts the development of robust and reliable machine learning models, ultimately advancing scientific understanding and improving practical applications across numerous fields.
Papers
AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
Pia Pachinger, Janis Goldzycher, Anna Maria Planitzer, Wojciech Kusa, Allan Hanbury, Julia Neidhardt
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Edoardo Debenedetti, Javier Rando, Daniel Paleka, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles
Nirmalya Thakur, Vanessa Su, Mingchen Shao, Kesha A. Patel, Hongseok Jeong, Victoria Knieling, Andrew Bian
MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs
Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition
Xin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard, Alice Baird, Bjoern Schuller
HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction
Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang
LLM Dataset Inference: Did you train on my dataset?
Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic
Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain
Brian Hu, Bill Ray, Alice Leung, Amy Summerville, David Joy, Christopher Funk, Arslan Basharat
WarCov -- Large multilabel and multimodal dataset from social platform
Weronika Borek-Marciniec, Pawel Zyblewski, Jakub Klikowski, Pawel Ksieniewicz
Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into German
Manuel Lardelli, Giuseppe Attanasio, Anne Lauscher
Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision
Qian Ruan, Ilia Kuznetsov, Iryna Gurevych
Multilingual Text Style Transfer: Datasets & Models for Indian Languages
Sourabrata Mukherjee, Atul Kr. Ojha, Akanksha Bansal, Deepak Alok, John P. McCrae, Ondřej Dušek
Image captioning in different languages
Emiel van Miltenburg
FinGen: A Dataset for Argument Generation in Finance
Chung-Chi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao
Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach
Muhammad Saif Ullah Khan, Dhavalkumar Limbachiya, Didier Stricker, Muhammad Zeshan Afzal
LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising
Yuxing Duan, Shihan Peng, Lin Zhu, Wei Zhang, Yi Chang, Sheng Zhong, Luxin Yan