Large Corpus

Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.

516papers

Papers - Page 11

March 25, 2024

Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks
Keyaki Ohno, Hirotaka Kameko, Keisuke Shirai, Taichi Nishimura, Shinsuke Mori
Automatic Construction Geo Entity Large Corpus Noisy Hyperlink

March 24, 2024

Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling
Yida Mu, Chun Dong, Kalina Bontcheva, Xingyi Song
Topic Modeling Significant Topic Large Language Model Current Method Topic Detection Large Corpus

March 23, 2024

RAAMove: A Corpus for Analyzing Moves in Research Article Abstracts
Hongzheng Li, Ruojin Wang, Ge Shi, Xing Lv, Lei Lei, Chong Feng, Fang Liu, Jinkun Lin, Yangguang Mei, Lingnan Xu
Multi Domain Corpus Discourse Structure Large Corpus Scientific Abstract Effective Non Local Move Annotated Corpus Corpus Creation

March 20, 2024

March 15, 2024

March 14, 2024

Geographically-Informed Language Identification
Jonathan Dunn, Lane Edwards-Brown
Language Label Large Corpus Language Identification

March 11, 2024

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang+2
Reference Summary Large Corpus Expert Driven Monitoring Peer Review Corpus Level Global Impact Large Language Model ChatGPT Generated Conversation

March 8, 2024

FFSTC: Fongbe to French Speech Translation Corpus
D. Fortune Kponou, Frejus A. A. Laleye, Eugene C. Ezin
Speech Translation Corpus Large Corpus Music Transcription

March 6, 2024

GPTopic: Dynamic and Interactive Topic Representations
Arik Reuter, Anton Thielmann, Christoph Weisser, Sebastian Fischer, Benjamin Säfken
Large Corpus Topic Classification Topic Modeling

March 4, 2024

March 2, 2024

VBART: The Turkish LLM
Meliksah Turker, Mehmet Erdi Ari, Aydin Han
Multilingual Model Large Corpus Turkish Text Language Model Turkish Natural Language

March 1, 2024

Large Corpus

Papers - Page 11

Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling

RAAMove: A Corpus for Analyzing Moves in Research Article Abstracts

A New Massive Multilingual Dataset for High-Performance Language Technologies

How Gender Interacts with Political Values: A Case Study on Czech BERT Models

Can Factual Statements be Deceptive? The DeFaBel Corpus of Belief-based Deception

RAFT: Adapting Language Model to Domain Specific RAG

Geographically-Informed Language Identification

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models

Validating and Exploring Large Geographic Corpora

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

FFSTC: Fongbe to French Speech Translation Corpus

GPTopic: Dynamic and Interactive Topic Representations

Detection of Non-recorded Word Senses in English and Swedish

LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset

VBART: The Turkish LLM

PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus

EUROPA: A Legal Multilingual Keyphrase Generation Dataset

CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions