Training Corpus
Training corpora are the massive datasets used to train large language models (LLMs), with current research focusing on improving their quality, diversity, and suitability for specific tasks. This involves developing methods for data selection and curation, including techniques that leverage data influence scores and address issues like data contamination and bias. The effective construction of training corpora is crucial for building high-performing and reliable LLMs, impacting various fields from scientific research to medical applications and beyond.
Papers
October 21, 2024
October 8, 2024
October 7, 2024
September 27, 2024
September 25, 2024
September 23, 2024
September 20, 2024
August 23, 2024
August 21, 2024
August 2, 2024
June 20, 2024
June 18, 2024
June 17, 2024
June 14, 2024
May 9, 2024
April 27, 2024
April 18, 2024
April 6, 2024
April 1, 2024
March 19, 2024