Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing
Amir Zeldes, Nick Howell, Noam Ordan, Yifat Ben Moshe
Bringing NURC/SP to Digital Life: the Role of Open-source Automatic Speech Recognition Models
Lucas Rafael Stefanel Gris, Arnaldo Candido Junior, Vinícius G. dos Santos, Bruno A. Papa Dias, Marli Quadros Leite, Flaviane Romani Fernandes Svartman, Sandra Aluísio
Developing a general-purpose clinical language inference model from a large corpus of clinical notes
Madhumita Sushil, Dana Ludwig, Atul J. Butte, Vivek A. Rudrapatna
RedHOT: A Corpus of Annotated Medical Questions, Experiences, and Claims on Social Media
Somin Wadhwa, Vivek Khetan, Silvio Amir, Byron Wallace
Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts
Juntao Yu, Silviu Paun, Maris Camilleri, Paloma Carretero Garcia, Jon Chamberlain, Udo Kruschwitz, Massimo Poesio
HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea
Haneul Yoo, Jiho Jin, Juhee Son, JinYeong Bak, Kyunghyun Cho, Alice Oh