Diverse Corpus
Diverse corpora, collections of text and other data from varied sources and languages, are crucial for training robust and generalizable natural language processing (NLP) models. Current research focuses on developing and evaluating these corpora, particularly for under-resourced languages, and on improving model training techniques to effectively leverage their heterogeneity, including methods like source prompting and retrieval augmentation. This work is significant because it addresses biases inherent in homogenous datasets and enables the development of NLP tools applicable across diverse linguistic and cultural contexts, impacting fields ranging from machine translation to mental health detection.
Papers
July 4, 2024
May 6, 2024
January 20, 2024
November 16, 2023
November 10, 2023
October 20, 2023
October 9, 2023
July 11, 2023
May 26, 2023
November 14, 2022
August 27, 2022
January 24, 2022
January 14, 2022
January 10, 2022
November 18, 2021