Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Machine Translation for Nko: Tools, Corpora and Baseline Results
Moussa Koulako Bala Doumbouya, Baba Mamadi Diané, Solo Farabado Cissé, Djibrila Diané, Abdoulaye Sow, Séré Moussa Doumbouya, Daouda Bangoura, Fodé Moriba Bayo, Ibrahima Sory 2. Condé, Kalo Mory Diané, Chris Piech, Christopher Manning
Interpreting Answers to Yes-No Questions in User-Generated Content
Shivam Mathur, Keun Hee Park, Dhivya Chinnappa, Saketh Kotamraju, Eduardo Blanco
CAW-coref: Conjunction-Aware Word-level Coreference Resolution
Karel D'Oosterlinck, Semere Kiros Bitew, Brandon Papineau, Christopher Potts, Thomas Demeester, Chris Develder
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions
Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko Aizawa, Hiroshi Saruwatari
CCAE: A Corpus of Chinese-based Asian Englishes
Yang Liu, Melissa Xiaohui Qin, Long Wang, Chao Huang
A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks
Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang
Written and spoken corpus of real and fake social media postings about COVID-19
Ng Bee Chin, Ng Zhi Ee Nicole, Kyla Kwan, Lee Yong Han Dylann, Liu Fang, Xu Hong
Quantized Transformer Language Model Implementations on Edge Devices
Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening, Salim Hariri, Sicong Shao, Pratik Satam, Soheil Salehi