Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models
Pengcheng Jiang, Shivam Agarwal, Bowen Jin, Xuan Wang, Jimeng Sun, Jiawei Han
The ACL OCL Corpus: Advancing Open Science in Computational Linguistics
Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, Min-Yen Kan
Music Representing Corpus Virtual: An Open Sourced Library for Explorative Music Generation, Sound Design, and Instrument Creation with Artificial Intelligence and Machine Learning
Christopher Johann Clarke
ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
Ruoyao Wang, Graham Todd, Eric Yuan, Ziang Xiao, Marc-Alexandre Côté, Peter Jansen
Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark
Feng Jiang, Weihao Liu, Xiaomin Chu, Peifeng Li, Qiaoming Zhu, Haizhou Li
Adapting Language Models to Compress Contexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen
CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions
Michael Kranzlein, Nathan Schneider, Kevin Tobia
Enabling Large Language Models to Generate Text with Citations
Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages
Milind Agarwal, Md Mahfuz Ibn Alam, Antonios Anastasopoulos
Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese
Rodrigo Wilkens, Leonardo Zilio, Aline Villavicencio
S\={a}mayik: A Benchmark and Dataset for English-Sanskrit Translation
Ayush Maheshwari, Ashim Gupta, Amrith Krishna, Atul Kumar Singh, Ganesh Ramakrishnan, G. Anil Kumar, Jitin Singla
Lifelong Language Pretraining with Distribution-Specialized Experts
Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cu
Patton: Language Model Pretraining on Text-Rich Networks
Bowen Jin, Wentao Zhang, Yu Zhang, Yu Meng, Xinyang Zhang, Qi Zhu, Jiawei Han
Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?
Neeraj Varshney, Mihir Parmar, Nisarg Patel, Divij Handa, Sayantan Sarkar, Man Luo, Chitta Baral