Data Quality
Data quality, encompassing accuracy, completeness, consistency, and timeliness of data, is crucial for reliable machine learning model performance and trustworthy AI applications. Current research focuses on developing automated methods for detecting and correcting data quality issues, including techniques like synthetic data generation, data augmentation, and the application of machine learning models themselves to refine datasets (e.g., using smaller models to improve larger ones). These efforts are driven by the need to improve the accuracy and robustness of AI systems across diverse fields, from social sciences and finance to healthcare and particle physics, where high-quality data is essential for reliable insights and decision-making.
Papers
Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy
Min Zeng, Caiquan Liu, Shiqi Zhang, Li Xie, Chen Sang, Xiaoxin Chen, Xiaoxin Chen
Measuring Pre-training Data Quality without Labels for Time Series Foundation Models
Songkang Wen, Vasilii Feofanov, Jianfeng Zhang