Training Corpus
Training corpora are the massive datasets used to train large language models (LLMs), with current research focusing on improving their quality, diversity, and suitability for specific tasks. This involves developing methods for data selection and curation, including techniques that leverage data influence scores and address issues like data contamination and bias. The effective construction of training corpora is crucial for building high-performing and reliable LLMs, impacting various fields from scientific research to medical applications and beyond.
Papers
April 18, 2024
April 6, 2024
April 1, 2024
March 19, 2024
March 13, 2024
March 12, 2024
March 11, 2024
February 15, 2024
January 9, 2024
December 19, 2023
December 17, 2023
December 15, 2023
November 21, 2023
November 14, 2023
October 31, 2023
October 23, 2023
October 19, 2023
September 30, 2023