Text Data Augmentation

Text data augmentation aims to improve the performance of natural language processing (NLP) models by artificially expanding training datasets. Current research heavily utilizes large language models (LLMs) like GPT-4 to generate augmented data through paraphrasing, question-answer pair creation, and other techniques, addressing issues like imbalanced datasets and information loss in long texts. This approach is particularly valuable in low-resource settings and for tasks such as relation extraction, automatic scoring, and speech recognition, where obtaining sufficient labeled data is challenging. The development of standardized evaluation metrics is an active area of research to ensure robust comparison and facilitate progress in the field.

Papers