Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Curatr: A Platform for Semantic Analysis and Curation of Historical Literary Texts
Susan Leavy, Gerardine Meaney, Karen Wade, Derek Greene
Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech
Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen
Prompt-based Extraction of Social Determinants of Health Using Few-shot Learning
Giridhar Kaushik Ramachandran, Yujuan Fu, Bin Han, Kevin Lybarger, Nicholas J Dobbins, Özlem Uzuner, Meliha Yetisgen
Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus
Théo Deschamps-Berger, Lori Lamel, Laurence Devillers
Gradient Ascent Post-training Enhances Language Model Generalization
Dongkeun Yoon, Joel Jang, Sungdong Kim, Minjoon Seo
Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Mining Insights at Scale
Jonas Oppenlaender, Joonas Hämäläinen
A modified model for topic detection from a corpus and a new metric evaluating the understandability of topics
Tomoya Kitano, Yuto Miyatake, Daisuke Furihata
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers
Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang
When to Read Documents or QA History: On Unified and Selective Open-domain QA
Kyungjae Lee, Sang-eun Han, Seung-won Hwang, Moontae Lee
RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction
Takahiro Fukumori, Taito Ishida, Yoichi Yamashita
Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer
Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
From `Snippet-lects' to Doculects and Dialects: Leveraging Neural Representations of Speech for Placing Audio Signals in a Language Landscape
Séverine Guillaume, Guillaume Wisniewski, Alexis Michaud
Information Association for Language Model Updating by Mitigating LM-Logical Discrepancy
Pengfei Yu, Heng Ji