High Quality Data
High-quality data is crucial for training effective machine learning models, particularly large language models (LLMs) and multimodal models. Current research focuses on developing methods for creating, cleaning, and selecting high-quality datasets, including techniques like gamified crowdsourcing, counterfactual explanations for data augmentation, and sophisticated filtering algorithms (e.g., ensemble KenLMs) to remove noise and bias. These efforts aim to improve model performance, robustness, and trustworthiness across various applications, from autonomous driving to medical diagnosis, while addressing challenges posed by imbalanced datasets and the high cost of data annotation.
Papers
February 18, 2024
January 25, 2024
January 4, 2024
December 25, 2023
August 21, 2023
July 17, 2023
July 10, 2023
July 7, 2023
June 24, 2023
June 21, 2023
June 13, 2023
June 7, 2023
May 22, 2023
December 1, 2022
November 30, 2022
July 25, 2022
June 3, 2022
March 19, 2022
March 12, 2022