Pre Training Corpus
Pre-training corpora are the massive datasets used to initially train large language models (LLMs), significantly impacting their capabilities. Current research focuses on improving corpus quality through automated methods like neural web scraping and model-driven data refinement, aiming to reduce biases, harmful content, and data contamination while enhancing efficiency. These efforts are crucial for building more reliable and robust LLMs, addressing concerns about data quality and ethical implications, and ultimately improving the performance and trustworthiness of downstream applications.
Papers
October 12, 2023
September 21, 2023
September 18, 2023
June 6, 2023
May 12, 2023
May 2, 2023
April 17, 2023
March 26, 2023
January 19, 2023
January 4, 2023
December 20, 2022
December 19, 2022
November 26, 2022
November 15, 2022
October 12, 2022
September 28, 2022
April 28, 2022
February 8, 2022