Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
An Analysis of Negation in Natural Language Understanding Corpora
Md Mosharaf Hossain, Dhivya Chinnappa, Eduardo Blanco
C-MORE: Pretraining to Answer Open-Domain Questions by Consulting Millions of References
Xiang Yue, Xiaoman Pan, Wenlin Yao, Dian Yu, Dong Yu, Jianshu Chen
Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data
Shuohang Wang, Yichong Xu, Yuwei Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, Michael Zeng
Data Contamination: From Memorization to Exploitation
Inbal Magar, Roy Schwartz
Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models
Mark Chu, Bhargav Srinivasa Desikan, Ethan O. Nadler, D. Ruggiero Lo Sardo, Elise Darragh-Ford, Douglas Guilbeault