Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition
Rodolfo Zevallos, Luis Camacho, Nelsi Melgarejo
Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique
Changnam An, Eunkyung Han, Dongmyeong Noh, Ohkyoon Kwon, Sumi Lee, Hyunshim Han
Quote Erat Demonstrandum: A Web Interface for Exploring the Quotebank Corpus
Vuk Vuković, Akhil Arora, Huan-Cheng Chang, Andreas Spitz, Robert West
BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus
Josh Meyer, David Ifeoluwa Adelani, Edresson Casanova, Alp Öktem, Daniel Whitenack Julian Weber, Salomon Kabongo, Elizabeth Salesky, Iroro Orife, Colin Leong, Perez Ogayo, Chris Emezue, Jonathan Mukiibi, Salomey Osei, Apelete Agbolo, Victor Akinode, Bernard Opoku, Samuel Olanrewaju, Jesujoba Alabi, Shamsuddeen Muhammad