Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
SCOUT: A Situated and Multi-Modal Human-Robot Dialogue Corpus
Stephanie M. Lukin, Claire Bonial, Matthew Marge, Taylor Hudson, Cory J. Hayes, Kimberly A. Pollard, Anthony Baker, Ashley N. Foots, Ron Artstein, Felix Gervits, Mitchell Abrams, Cassidy Henry, Lucia Donatelli, Anton Leuski, Susan G. Hill, David Traum, Clare R. Voss
Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction
Sonny George, Chris Sypherd, Dylan Cashman
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus
Terufumi Morishita, Gaku Morio, Atsuki Yamaguchi, Yasuhiro Sogawa
ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity
Tong Xie, Hanzhi Zhang, Shaozhou Wang, Yuwei Wan, Imran Razzak, Chunyu Kit, Wenjie Zhang, Bram Hoex
Large corpora and large language models: a replicable method for automating grammatical annotation
Cameron Morin, Matti Marttinen Larsson
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
Benjamin Litterer, David Jurgens, Dallas Card
Ethical Concern Identification in NLP: A Corpus of ACL Anthology Ethics Statements
Antonia Karamolegkou, Sandrine Schiller Hansen, Ariadni Christopoulou, Filippos Stamatiou, Anne Lauscher, Anders Søgaard
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language
Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, Yihong Chen, Raphael Tang, Pontus Stenetorp
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran, François Yvon, Hinrich Schütze