Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
CAW-coref: Conjunction-Aware Word-level Coreference Resolution
Karel D'Oosterlinck, Semere Kiros Bitew, Brandon Papineau, Christopher Potts, Thomas Demeester, Chris Develder
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions
Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko Aizawa, Hiroshi Saruwatari
CCAE: A Corpus of Chinese-based Asian Englishes
Yang Liu, Melissa Xiaohui Qin, Long Wang, Chao Huang
A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks
Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang
Written and spoken corpus of real and fake social media postings about COVID-19
Ng Bee Chin, Ng Zhi Ee Nicole, Kyla Kwan, Lee Yong Han Dylann, Liu Fang, Xu Hong
Quantized Transformer Language Model Implementations on Edge Devices
Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening, Salim Hariri, Sicong Shao, Pratik Satam, Soheil Salehi
The Cambridge Law Corpus: A Dataset for Legal AI Research
Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek
How-to Guides for Specific Audiences: A Corpus and Initial Findings
Nicola Fanton, Agnieszka Falenska, Michael Roth
Knowledge Sanitization of Large Language Models
Yoichi Ishibashi, Hidetoshi Shimodaira
Improve the efficiency of deep reinforcement learning through semantic exploration guided by natural language
Zhourui Guo, Meng Yao, Yang Yu, Qiyue Yin