Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Lifelong Language Pretraining with Distribution-Specialized Experts
Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cu
Patton: Language Model Pretraining on Text-Rich Networks
Bowen Jin, Wentao Zhang, Yu Zhang, Yu Meng, Xinyang Zhang, Qi Zhu, Jiawei Han
Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?
Neeraj Varshney, Mihir Parmar, Nisarg Patel, Divij Handa, Sayantan Sarkar, Man Luo, Chitta Baral
SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish
Dmytro Kalpakchi, Johan Boye
BactInt: A domain driven transfer learning approach and a corpus for extracting inter-bacterial interactions from biomedical text
Krishanu Das Baksi, Vatsala Pokhrel, Kuntal Kumar Bhusan, Sharmila Mande