Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
How-to Guides for Specific Audiences: A Corpus and Initial Findings
Nicola Fanton, Agnieszka Falenska, Michael Roth
Knowledge Sanitization of Large Language Models
Yoichi Ishibashi, Hidetoshi Shimodaira
Improve the efficiency of deep reinforcement learning through semantic exploration guided by natural language
Zhourui Guo, Meng Yao, Yang Yu, Qiyue Yin
AlbNER: A Corpus for Named Entity Recognition in Albanian
Erion Çano
Generating Semantic Graph Corpora with Graph Expansion Grammar
Eric Andersson, Johanna Björklund, Frank Drewes, Anna Jonsson
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey
Presenting the SWTC: A Symbolic Corpus of Themes from John Williams' Star Wars Episodes I-IX
Claire Arthur, Frank Lehman, John McNamara
HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models
Guijin Son, Hanwool Lee, Suwan Kim, Huiseo Kim, Jaecheol Lee, Je Won Yeom, Jihyu Jung, Jung Woo Kim, Songseong Kim