Data Filtering

Data filtering aims to improve the quality and efficiency of machine learning by selectively removing or weighting less valuable data points from training datasets. Current research focuses on developing sophisticated filtering techniques, often leveraging large language models (LLMs) and contrastive learning methods to identify and remove noisy, irrelevant, or low-quality data, as well as incorporating metrics like CLIP scores and quality estimation (QE) to assess data utility. Effective data filtering is crucial for enhancing the performance and robustness of various machine learning models, particularly in resource-constrained scenarios and applications like multilingual translation and question answering, leading to improved model accuracy and reduced training costs.

Papers