Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese
Sidney Evaldo Leal, Magali Sanches Duran, Carolina Evaristo Scarton, Nathan Siegle Hartmann, Sandra Maria Aluísio
JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification
Shinnosuke Takamichi, Ludwig Kürzinger, Takaaki Saeki, Sayaka Shiota, Shinji Watanabe
Penn-Helsinki Parsed Corpus of Early Modern English: First Parsing Results and Analysis
Seth Kulick, Neville Ryant, Beatrice Santorini
KGR^4: Retrieval, Retrospect, Refine and Rethink for Commonsense Generation
Xin Liu, Dayiheng Liu, Baosong Yang, Haibo Zhang, Junwei Ding, Wenqing Yao, Weihua Luo, Haiying Zhang, Jinsong Su
Semantic Segmentation of Legal Documents via Rhetorical Roles
Vijit Malik, Rishabh Sanjay, Shouvik Kumar Guha, Angshuman Hazarika, Shubham Nigam, Arnab Bhattacharya, Ashutosh Modi
Creating and Managing a large annotated parallel corpora of Indian languages
Ritesh Kumar, Shiv Bhusan Kaushik, Pinkey Nainwani, Girish Nath Jha