Automatic Curation
Automatic curation focuses on using computational methods to efficiently organize, label, and enhance datasets, addressing the limitations of manual curation in terms of cost, time, and scalability. Current research emphasizes developing algorithms and models, including those based on transformers, diffusion models, and clustering techniques, to automate tasks such as data cleaning, annotation, and selection for various data types (text, images, videos). This automated approach is crucial for advancing machine learning across diverse fields, from biomedical research and scientific publishing to autonomous driving and public art curation, by providing high-quality, readily accessible datasets for training and evaluation.
Papers
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Catherine Arnett, Eliot Jones, Ivan P. Yamshchikov, Pierre-Carl Langlais
CurateGPT: A flexible language-model assisted biocuration tool
Harry Caufield, Carlo Kroll, Shawn T O'Neil, Justin T Reese, Marcin P Joachimiak, Harshad Hegde, Nomi L Harris, Madan Krishnamurthy, James A McLaughlin, Damian Smedley, Melissa A Haendel, Peter N Robinson, Christopher J Mungall