Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
Building an Icelandic Entity Linking Corpus
Steinunn Rut Friðriksdóttir, Valdimar Ágúst Eggertsson, Benedikt Geir Jóhannesson, Hjalti Daníelsson, Hrafn Loftsson, Hafsteinn Einarsson
Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing
Elena Alvarez Mellado, Constantine Lignos
RuCoCo: a new Russian corpus with coreference annotation
Vladimir Dobrovolskii, Mariia Michurina, Alexandra Ivoylova
Unsupervised Key Event Detection from Massive Text Corpora
Yunyi Zhang, Fang Guo, Jiaming Shen, Jiawei Han
The Open corpus of the Veps and Karelian languages: overview and applications
Tatyana Boyko, Nina Zaitseva, Natalia Krizhanovskaya, Andrew Krizhanovsky, Irina Novak, Nataliya Pellinen, Aleksandra Rodionova
LEXpander: applying colexification networks to automated lexicon expansion
Anna Di Natale, David Garcia
An Informational Space Based Semantic Analysis for Scientific Texts
Neslihan Suzen, Alexander N. Gorban, Jeremy Levesley, Evgeny M. Mirkes
NEWTS: A Corpus for News Topic-Focused Summarization
Seyed Ali Bahrainian, Sheridan Feucht, Carsten Eickhoff
APPReddit: a Corpus of Reddit Posts Annotated for Appraisal
Marco Antonio Stranisci, Simona Frenda, Eleonora Ceccaldi, Valerio Basile, Rossana Damiano, Viviana Patti