Data Curation
Data curation focuses on the systematic collection, organization, and refinement of datasets to optimize the performance and reliability of machine learning models. Current research emphasizes automated curation techniques, leveraging large language models (LLMs) to improve data quality, address biases, and efficiently filter large-scale datasets, often incorporating methods like embedding-based filtering and curriculum learning. This work is crucial for advancing various fields, including natural language processing, computer vision, and biomedical research, by ensuring the availability of high-quality, unbiased datasets essential for training robust and reliable AI systems.
Papers
The Evolution of LLM Adoption in Industry Data Curation Practices
Crystal Qian, Michael Xieyang Liu, Emily Reif, Grady Simon, Nada Hussein, Nathan Clement, James Wexler, Carrie J. Cai, Michael Terry, Minsuk Kahng
Efficient Curation of Invertebrate Image Datasets Using Feature Embeddings and Automatic Size Comparison
Mikko Impiö, Philipp M. Rehsen, Jenni Raitoharju
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
Fei Wang, Ninareh Mehrabi, Palash Goyal, Rahul Gupta, Kai-Wei Chang, Aram Galstyan
SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification
Benjamin Feuer, Jiawei Xu, Niv Cohen, Patrick Yubeaton, Govind Mittal, Chinmay Hegde
Data curation via joint example selection further accelerates multimodal learning
Talfan Evans, Nikhil Parthasarathy, Hamza Merzic, Olivier J. Henaff
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf