Quality Corpus
High-quality corpora are crucial for training effective natural language processing (NLP) models, particularly large language models (LLMs). Current research focuses on creating and improving these corpora through rigorous data cleaning, deduplication, and methods like ensemble techniques, often incorporating diverse sources such as literature, web data, and multilingual content to enhance model performance and address biases. The availability of such corpora is vital for advancing NLP across various languages and domains, impacting applications ranging from machine translation and text-to-speech to legal document processing and medical information retrieval.
Papers
July 26, 2024
June 21, 2024
June 4, 2024
June 2, 2024
May 24, 2024
May 22, 2024
September 19, 2023
September 8, 2023
July 11, 2023
June 1, 2023
May 27, 2023
May 23, 2023
April 28, 2023
April 27, 2022
April 19, 2022
April 2, 2022