Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
STONYBOOK: A System and Resource for Large-Scale Analysis of Novels
Charuta Pethe, Allen Kim, Rajesh Prabhakar, Tanzir Pial, Steven Skiena
Spoken Dialogue System for Medical Prescription Acquisition on Smartphone: Development, Corpus and Evaluation
Ali Can Kocabiyikoglu, François Portet, Jean-Marc Babouchkine, Prudence Gibert, Hervé Blanchon, Gaëtan Gavazzi
On the effect of curriculum learning with developmental data for grammar acquisition
Mattia Opper, J. Morrison, N. Siddharth
What's In My Big Data?
Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge
ChiSCor: A Corpus of Freely Told Fantasy Stories by Dutch Children for Computational Linguistics and Cognitive Science
Bram M. A. van Dijk, Max J. van Duijn, Suzan Verberne, Marco R. Spruit
Learning to Play Chess from Textbooks (LEAP): a Corpus for Evaluating Chess Moves based on Sentiment Analysis
Haifa Alrdahi, Riza Batista-Navarro
Machine Translation for Nko: Tools, Corpora and Baseline Results
Moussa Koulako Bala Doumbouya, Baba Mamadi Diané, Solo Farabado Cissé, Djibrila Diané, Abdoulaye Sow, Séré Moussa Doumbouya, Daouda Bangoura, Fodé Moriba Bayo, Ibrahima Sory 2. Condé, Kalo Mory Diané, Chris Piech, Christopher Manning
Interpreting Answers to Yes-No Questions in User-Generated Content
Shivam Mathur, Keun Hee Park, Dhivya Chinnappa, Saketh Kotamraju, Eduardo Blanco