Training Corpus

Training corpora are the massive datasets used to train large language models (LLMs), with current research focusing on improving their quality, diversity, and suitability for specific tasks. This involves developing methods for data selection and curation, including techniques that leverage data influence scores and address issues like data contamination and bias. The effective construction of training corpora is crucial for building high-performing and reliable LLMs, impacting various fields from scientific research to medical applications and beyond.

Papers