Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
DeepLens: Interactive Out-of-distribution Data Detection in NLP Models
Da Song, Zhijie Wang, Yuheng Huang, Lei Ma, Tianyi Zhang
NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools
Peiran Yao, Matej Kosmajac, Abeer Waheed, Kostyantyn Guzhva, Natalie Hervieux, Denilson Barbosa
Leveraging Large Text Corpora for End-to-End Speech Summarization
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura