Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Entity Cloze By Date: What LMs Know About Unseen Entities
Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, Greg Durrett
CATs are Fuzzy PETs: A Corpus and Analysis of Potentially Euphemistic Terms
Martha Gavidia, Patrick Lee, Anna Feldman, Jing Peng
RaFoLa: A Rationale-Annotated Corpus for Detecting Indicators of Forced Labour
Erick Mendez Guzman, Viktor Schlegel, Riza Batista-Navarro
Balancing Multi-Domain Corpora Learning for Open-Domain Response Generation
Yujie Xing, Jinglun Cai, Nils Barlaug, Peng Liu, Jon Atle Gulla
MeSHup: A Corpus for Full Text Biomedical Document Indexing
Xindi Wang, Robert E. Mercer, Frank Rudzicz
Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online
Dana Ruiter, Liane Reiners, Ashwin Geet D'Sa, Thomas Kleinbauer, Dominique Fohr, Irina Illina, Dietrich Klakow, Christian Schemer, Angeliki Monnier