Pre Training Corpus

Pre-training corpora are the massive datasets used to initially train large language models (LLMs), significantly impacting their capabilities. Current research focuses on improving corpus quality through automated methods like neural web scraping and model-driven data refinement, aiming to reduce biases, harmful content, and data contamination while enhancing efficiency. These efforts are crucial for building more reliable and robust LLMs, addressing concerns about data quality and ethical implications, and ultimately improving the performance and trustworthiness of downstream applications.

Papers