Low Resource Language
Low-resource language (LRL) research focuses on developing natural language processing (NLP) techniques for languages lacking substantial digital resources, aiming to bridge the technological gap between high- and low-resource languages. Current research emphasizes leveraging multilingual pre-trained models like Whisper and adapting them to LRLs through techniques such as weighted cross-entropy, data augmentation (including synthetic data generation), and model optimization methods like pruning and knowledge distillation. This work is crucial for promoting linguistic diversity, enabling access to technology for under-resourced communities, and advancing the broader field of NLP by addressing the challenges posed by data scarcity and linguistic variation.
Papers
Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers
Aivin V. Solatorio, Gabriel Stefanini Vicente, Holly Krambeck, Olivier Dupriez
Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?
Zeno Vandenbulcke, Lukas Vermeire, Miryam de Lhoneux
Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation
Sharif Kazemi, Gloria Gerhardt, Jonty Katz, Caroline Ida Kuria, Estelle Pan, Umang Prabhakar
Ukrainian-to-English folktale corpus: Parallel corpus creation and augmentation for machine translation in low-resource languages
Olena Burda-Lassen
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
Abdullatif Köksal, Marion Thaler, Ayyoob Imani, Ahmet Üstün, Anna Korhonen, Hinrich Schütze
Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios
Aditya Joshi, Diptesh Kanojia, Heather Lent, Hour Kaing, Haiyue Song