Training Corpus
Training corpora are the massive datasets used to train large language models (LLMs), with current research focusing on improving their quality, diversity, and suitability for specific tasks. This involves developing methods for data selection and curation, including techniques that leverage data influence scores and address issues like data contamination and bias. The effective construction of training corpora is crucial for building high-performing and reliable LLMs, impacting various fields from scientific research to medical applications and beyond.
Papers
August 29, 2022
July 4, 2022
June 6, 2022
May 24, 2022
May 12, 2022
May 10, 2022
April 28, 2022
April 22, 2022
March 25, 2022
March 15, 2022
March 6, 2022
February 25, 2022
February 21, 2022
January 28, 2022
January 25, 2022
January 14, 2022