Data Quality
Data quality, encompassing accuracy, completeness, consistency, and timeliness of data, is crucial for reliable machine learning model performance and trustworthy AI applications. Current research focuses on developing automated methods for detecting and correcting data quality issues, including techniques like synthetic data generation, data augmentation, and the application of machine learning models themselves to refine datasets (e.g., using smaller models to improve larger ones). These efforts are driven by the need to improve the accuracy and robustness of AI systems across diverse fields, from social sciences and finance to healthcare and particle physics, where high-quality data is essential for reliable insights and decision-making.
Papers
Data Quality in Imitation Learning
Suneel Belkhale, Yuchen Cui, Dorsa Sadigh
Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines
Dezhan Tu, Yeye He, Weiwei Cui, Song Ge, Haidong Zhang, Han Shi, Dongmei Zhang, Surajit Chaudhuri
Topological data quality via 0-dimensional persistence matching
Álvaro Torras-Casas, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz