Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages
Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral
Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
Ning Ding, Yulin Chen, Ganqu Cui, Xingtai Lv, Weilin Zhao, Ruobing Xie, Bowen Zhou, Zhiyuan Liu, Maosong Sun
Validating and Exploring Large Geographic Corpora
Jonathan Dunn
PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus
Deborah N. Jakobi, Thomas Kern, David R. Reich, Patrick Haller, Lena A. Jäger
EUROPA: A Legal Multilingual Keyphrase Generation Dataset
Olivier Salaün, Frédéric Piedboeuf, Guillaume Le Berre, David Alfonso Hermelo, Philippe Langlais
CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions
Leane Jourdan, Florian Boudin, Nicolas Hernandez, Richard Dufour
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs
Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim Munawar, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, Luis A. Lastras
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long, Xuri Ge, Richard Mccreadie, Joemon Jose
IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei Liang, Huajun Chen
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
Xiuying Chen, Tairan Wang, Qingqing Zhu, Taicheng Guo, Shen Gao, Zhiyong Lu, Xin Gao, Xiangliang Zhang