Corpus Creation

Corpus creation focuses on building large, high-quality datasets of text and/or speech for training and evaluating natural language processing (NLP) models. Current research emphasizes creating corpora tailored to specific tasks, such as scientific mention detection, adverse drug event identification, and analysis of argumentative structures, often incorporating multimodal data (text and images) and leveraging deep learning architectures like transformers (e.g., BERT) and large language models (LLMs). These corpora are crucial for advancing NLP research, particularly in low-resource languages, and improving applications ranging from information retrieval and machine translation to healthcare and education.

Papers