Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus
Deborah N. Jakobi, Thomas Kern, David R. Reich, Patrick Haller, Lena A. Jäger
EUROPA: A Legal Multilingual Keyphrase Generation Dataset
Olivier Salaün, Frédéric Piedboeuf, Guillaume Le Berre, David Alfonso Hermelo, Philippe Langlais
CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions
Leane Jourdan, Florian Boudin, Nicolas Hernandez, Richard Dufour
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs
Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim Munawar, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, Luis A. Lastras
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long, Xuri Ge, Richard Mccreadie, Joemon Jose
IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei Liang, Huajun Chen
Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark
Xiuying Chen, Tairan Wang, Qingqing Zhu, Taicheng Guo, Shen Gao, Zhiyong Lu, Xin Gao, Xiangliang Zhang
From Text to CQL: Bridging Natural Language and Corpus Search Engine
Luming Lu, Jiyuan An, Yujie Wang, Liner yang, Cunliang Kong, Zhenghao Liu, Shuo Wang, Haozhe Lin, Mingwei Fang, Yaping Huang, Erhong Yang
STENCIL: Submodular Mutual Information Based Weak Supervision for Cold-Start Active Learning
Nathan Beck, Adithya Iyer, Rishabh Iyer