Web Mined Corpus

Web-mined corpora, massive datasets of text and images scraped from the internet, are crucial for training large language models (LLMs) and other NLP applications. Current research emphasizes improving data quality by addressing issues like noise, redundancy, bias, and the inclusion of sensitive information, often employing heuristic, embedding-based, or classifier-based methods for data pruning and filtering. This focus on data curation aims to enhance LLM performance, efficiency, and ethical responsibility, impacting various downstream tasks from code generation to cross-lingual applications. The development of benchmarks and standardized metrics for evaluating corpus quality is also a significant area of ongoing work.

Papers