Data Curation

Data curation focuses on the systematic collection, organization, and refinement of datasets to optimize the performance and reliability of machine learning models. Current research emphasizes automated curation techniques, leveraging large language models (LLMs) to improve data quality, address biases, and efficiently filter large-scale datasets, often incorporating methods like embedding-based filtering and curriculum learning. This work is crucial for advancing various fields, including natural language processing, computer vision, and biomedical research, by ensuring the availability of high-quality, unbiased datasets essential for training robust and reliable AI systems.

Papers