Web Mined Corpus
Web-mined corpora, massive datasets of text and images scraped from the internet, are crucial for training large language models (LLMs) and other NLP applications. Current research emphasizes improving data quality by addressing issues like noise, redundancy, bias, and the inclusion of sensitive information, often employing heuristic, embedding-based, or classifier-based methods for data pruning and filtering. This focus on data curation aims to enhance LLM performance, efficiency, and ethical responsibility, impacting various downstream tasks from code generation to cross-lingual applications. The development of benchmarks and standardized metrics for evaluating corpus quality is also a significant area of ongoing work.
Papers
July 10, 2024
June 29, 2024
March 19, 2024
February 12, 2024
November 11, 2023
April 14, 2023