Corpus Training

Corpus training focuses on optimizing the datasets used to train language models, aiming to improve model performance and generalization capabilities. Current research emphasizes mitigating issues like data contamination and exploring efficient training strategies, including adaptive multi-corpora training and methods leveraging limited labeled data (e.g., extremely weak supervision). These advancements are crucial for enhancing the accuracy and robustness of language models across various NLP tasks, particularly in low-resource settings and domains where data scarcity is a significant challenge.

Papers