Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Synthetic Pre-Training Tasks for Neural Machine Translation
Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley, Rogerio Feris
LENS: A Learnable Evaluation Metric for Text Simplification
Mounica Maddela, Yao Dou, David Heineman, Wei Xu
MANER: Mask Augmented Named Entity Recognition for Extreme Low-Resource Languages
Shashank Sonkar, Zichao Wang, Richard G. Baraniuk
Fine-tuning a Subtle Parsing Distinction Using a Probabilistic Decision Tree: the Case of Postnominal "that" in Noun Complement Clauses vs. Relative Clauses
Zineddine Tighidet, Nicolas Ballier
Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog
Mika Hämäläinen, Khalid Alnajjar, Thierry Poibeau
Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer
Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding, Zhiruo Wang, Jamie Callan, Graham Neubig