Small Corpus

Research on small corpora focuses on developing methods for effectively training and utilizing language models with limited data, addressing challenges in various NLP tasks. Current efforts involve adapting existing architectures like Transformers and BERT, employing techniques such as transfer learning, data augmentation (including hallucinated data), and novel annotation schemes to maximize performance despite data scarcity. This research is crucial for advancing NLP in low-resource languages and domains where large datasets are unavailable, enabling applications in areas like healthcare, legal tech, and accessibility.

Papers