Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models
Philip Feldman, Aaron Dant, James R. Foulds, Shemei Pan
Automated speech tools for helping communities process restricted-access corpora for language revival efforts
Nay San, Martijn Bartelds, Tolúlopé Ògúnrèmí, Alison Mount, Ruben Thompson, Michael Higgins, Roy Barker, Jane Simpson, Dan Jurafsky
Linking Emergent and Natural Languages via Corpus Transfer
Shunyu Yao, Mo Yu, Yang Zhang, Karthik R Narasimhan, Joshua B. Tenenbaum, Chuang Gan
Kratt: Developing an Automatic Subject Indexing Tool for The National Library of Estonia
Marit Asula, Jane Makke, Linda Freienthal, Hele-Andra Kuulmets, Raul Sirel
Lahjoita puhetta -- a large-scale corpus of spoken Finnish with some benchmarks
Anssi Moisio, Dejan Porjazovski, Aku Rouhe, Yaroslav Getman, Anja Virkkunen, Tamás Grósz, Krister Lindén, Mikko Kurimo