Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
516papers
Papers - Page 11
March 20, 2024
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo+5How Gender Interacts with Political Values: A Case Study on Czech BERT Models
Adnan Al Ali, Jindřich Libovický
March 15, 2024
March 13, 2024
Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages
Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio ToralMastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
Ning Ding, Yulin Chen, Ganqu Cui, Xingtai Lv, Weilin Zhao, Ruobing Xie, Bowen Zhou, Zhiyuan Liu, Maosong SunValidating and Exploring Large Geographic Corpora
Jonathan Dunn
March 8, 2024
March 4, 2024
March 1, 2024
PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus
Deborah N. Jakobi, Thomas Kern, David R. Reich, Patrick Haller, Lena A. JägerEUROPA: A Legal Multilingual Keyphrase Generation Dataset
Olivier Salaün, Frédéric Piedboeuf, Guillaume Le Berre, David Alfonso Hermelo, Philippe LanglaisCASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions
Leane Jourdan, Florian Boudin, Nicolas Hernandez, Richard Dufour