Clean Data
Clean data, crucial for reliable machine learning, is often unavailable or compromised in various applications. Current research focuses on developing methods to purify noisy or poisoned datasets, often employing generative adversarial networks (GANs), diffusion models, and energy-based models to either remove noise or identify and correct corrupted data points without relying on separate clean datasets. These techniques are vital for improving the robustness and reliability of machine learning models across diverse fields, including bioacoustics, medical imaging, and cybersecurity, where access to perfectly clean data is often impractical or impossible. The development of effective data purification methods is essential for advancing the trustworthiness and real-world applicability of machine learning.
Papers
PureEBM: Universal Poison Purification via Mid-Run Dynamics of Energy-Based Models
Omead Pooladzandi, Jeffrey Jiang, Sunay Bhat, Gregory Pottie
PureGen: Universal Data Purification for Train-Time Poison Defense via Generative Model Dynamics
Sunay Bhat, Jeffrey Jiang, Omead Pooladzandi, Alexander Branch, Gregory Pottie
Exact Recovery for System Identification with More Corrupt Data than Clean Data
Baturalp Yalcin, Haixiang Zhang, Javad Lavaei, Murat Arcak
Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks
Alon Jacovi, Avi Caciularu, Omer Goldman, Yoav Goldberg