Specific Corpus

Specific corpora, collections of text data tailored to particular domains, are crucial for advancing natural language processing (NLP). Current research emphasizes creating and utilizing these corpora for diverse applications, ranging from analyzing historical language contact and detecting online conspiracy theories to improving the performance of large language models (LLMs) in specialized fields like medicine and education. Researchers are employing various techniques, including LLMs for data cleaning and prompt optimization, support vector machines for classification, and knowledge distillation for model compression, to enhance the utility and accuracy of NLP models trained on these specialized datasets. The development and analysis of such corpora are vital for improving the reliability and applicability of NLP across numerous scientific disciplines and practical applications.

Papers

October 21, 2024

RAG4ITOps: A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance
Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, Jiawei Ren
Retrieval Augmented Generation Open Domain Question Answering Maintenance Required RAG Based Domain Specific Question Answering Specific Corpus

August 8, 2024

Moly\'e: A Corpus-based Approach to Language Contact in Colonial France
Rasul Dent, Juliette Janès, Thibault Clérice, Pedro Ortiz Suarez, Benoît Sagot
Corpus Based Natural Language Communication Specific Corpus Creole Language

July 4, 2024

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-Gómez, Tony Montes, Arturo Rodríguez-Herrera, Rubén Manrique
Large Corpus Optical Character Recognition Latin Text Linguistic Data Specific Corpus History Representation Spanish Corpus

May 8, 2024

CourseGPT-zh: an Educational Large Language Model Based on Knowledge Distillation Incorporating Prompt Optimization
Zheyan Qu, Lu Yin, Zitong Yu, Wenbo Wang, Xing zhang
Large Language Model Language Model Natural Language Processing Knowledge Distillation Question Answering Prompt Optimization Open Source LLM Specific Corpus

April 27, 2024

Detection of Conspiracy Theories Beyond Keyword Bias in German-Language Telegram Using Large Language Models
Milena Pustet, Elisabeth Steffen, Helena Mihaljević
Data Detection Supervised Fine Tuning Temporal Shift Token Bias Conspiracy Theory Specific Corpus Telegram Post

March 29, 2024

Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus
Silvia García-Méndez, Milagros Fernández-Gavilanes, Jonathan Juncal-Martínez, Francisco J. González-Castaño, Oscar Barba Seara
Natural Language Processing Short Text Specific Corpus Text Representation Method

February 10, 2024

DAEDRA: A language model for predicting outcomes in passive pharmacovigilance reporting
Chris von Csefalvay
Language Model Large Corpus Multiple Outcome Domain Specific Language Model Domain Specific Model General Language Model Pharmacovigilance Event Extraction Specific Corpus

November 14, 2023

MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China
Chen Zhang, Mingxu Tao, Quzhe Huang, Jiuheng Lin, Zhibin Chen, Yansong Feng
Low Resource Language Multilingual Corpus Rural China Specific Corpus

December 15, 2022

The Effects of In-domain Corpus Size on pre-training BERT
Chris Sanchez, Zheyuan Zhang
Mixed Effect Domain Specific Bidirectional Encoder Representation Biomedical Corpus Specific Corpus

October 10, 2022

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks
Charith Peris, Lizhen Tan, Thomas Gueudre, Turan Gojayev, Pan Wei, Gokmen Oz
Knowledge Distillation Global Impact Task Specific Downstream NLP Task Specific Corpus General Corpus

September 20, 2022

Register Variation Remains Stable Across 60 Languages
Haipeng Li, Jonathan Dunn, Andrea Nini
Unknown Language Linguistic Feature Stable Code Specific Corpus Communicative Context

July 26, 2022

Learning structures of the French clinical language:development and validation of word embedding models using 21 million clinical reports from electronic health records
Basile Dura, Charline Jean, Xavier Tannier, Alice Calliger, Romain Bey, Antoine Neuraz, Rémi Flicoteaux
Language Model NLP Task Electronic Health Record Real Text Word Specific Corpus Learning Structure French Clinical

May 24, 2022

Overview of STEM Science as Process, Method, Material, and Data Named Entities
Jennifer D'Souza
Knowledge Graph Practical Method Entity Mention Complex Process Material Response STEM Education Specific Corpus Textual Database

Specific Corpus

Papers

RAG4ITOps: A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance

Moly\'e: A Corpus-based Approach to Language Contact in Colonial France

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

CourseGPT-zh: an Educational Large Language Model Based on Knowledge Distillation Incorporating Prompt Optimization

Detection of Conspiracy Theories Beyond Keyword Bias in German-Language Telegram Using Large Language Models

Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

DAEDRA: A language model for predicting outcomes in passive pharmacovigilance reporting

MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

The Effects of In-domain Corpus Size on pre-training BERT

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Register Variation Remains Stable Across 60 Languages

Learning structures of the French clinical language:development and validation of word embedding models using 21 million clinical reports from electronic health records

Overview of STEM Science as Process, Method, Material, and Data Named Entities