Data Cleaning

Data cleaning aims to improve the quality and reliability of datasets used in machine learning and other data-driven applications by identifying and correcting errors, inconsistencies, and redundancies. Current research emphasizes efficient and scalable methods, including the use of neural networks, ensemble techniques, and large language models (LLMs) for tasks like outlier detection, label correction, and handling missing data. These advancements are crucial for enhancing the performance and trustworthiness of machine learning models across diverse fields, from climate science and medicine to natural language processing and code generation, ultimately leading to more reliable and impactful applications.

Papers