Corpus Training
Corpus training focuses on optimizing the datasets used to train language models, aiming to improve model performance and generalization capabilities. Current research emphasizes mitigating issues like data contamination and exploring efficient training strategies, including adaptive multi-corpora training and methods leveraging limited labeled data (e.g., extremely weak supervision). These advancements are crucial for enhancing the accuracy and robustness of language models across various NLP tasks, particularly in low-resource settings and domains where data scarcity is a significant challenge.
Papers
June 20, 2024
May 24, 2024
February 4, 2024
November 6, 2023
December 19, 2022