Low Resource Language

Low-resource language (LRL) research focuses on developing natural language processing (NLP) techniques for languages lacking substantial digital resources, aiming to bridge the technological gap between high- and low-resource languages. Current research emphasizes leveraging multilingual pre-trained models like Whisper and adapting them to LRLs through techniques such as weighted cross-entropy, data augmentation (including synthetic data generation), and model optimization methods like pruning and knowledge distillation. This work is crucial for promoting linguistic diversity, enabling access to technology for under-resourced communities, and advancing the broader field of NLP by addressing the challenges posed by data scarcity and linguistic variation.

Papers

February 12, 2024

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, Charitha Rathnayake
Large Corpus Neural Machine Translation Low Resource Language Quality Issue Task Utility Parallel Corpus Language Pair Artefact PARTICLE Deep Dive Web Mined Corpus

February 8, 2024

February 6, 2024

AlbNews: A Corpus of Headlines for Topic Modeling in Albanian
Erion Çano, Dario Lamaj
Large Corpus Low Resource Language Natural Language Processing Task Topic Modeling Informative Sample

February 4, 2024

Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity
Eric Khiu, Hasti Toossi, David Anugraha, Jinyu Liu, Jiaxu Li, Juan Armando Parra Flores, Leandro Acros Roman, A. Seza Doğruöz, En-Shiun Annie Lee
Integral Role Low Resource Language Multilingual Large Language Model High Resource Language Machine Translation Model Language Similarity Machine Translation Performance Domain Similarity Corpus Training

February 3, 2024

January 29, 2024

Massively Multilingual Text Translation For Low-Resource Languages
Zhong Zhou
Machine Translation Low Resource Language Multilingual Translation

January 24, 2024

January 12, 2024

XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese
Panji Arisaputra, Alif Tri Handoyo, Amalia Zahra
Automatic Speech Recognition Low Resource Language Automatic Speech Recognition System Multilingual Automatic Speech Recognition Automatic Speech Recognition Performance Indonesian Language Speech Recognition Accuracy

January 11, 2024

January 7, 2024

Building Efficient and Effective OpenQA Systems for Low-Resource Languages
Emrah Budur, Rıza Özçelik, Dilara Soylu, Omar Khattab, Tunga Güngör, Christopher Potts
Low Resource Language OpenQA System Natural Language Answer

January 4, 2024

LLM Augmented LLMs: Expanding Capabilities through Composition
Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Shikhar Vashishth, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, Partha Talukdar
Language Model Code Generation Low Resource Language Capability Evolution Compositional Ability Foundational Model Code Model LLM Based Augmentation

January 2, 2024

LLaMA Beyond English: An Empirical Study on Language Capability Transfer
Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, Xuanjing Huang
Large Language Model Empirical Study Language Generation Low Resource Language LLaMa LlamaCare Knowledge Alignment Vocabulary Expansion Lingual Transfer Capability

December 26, 2023

Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages
Mofetoluwa Adeyemi, Akintunde Oladipo, Ronak Pradeep, Jimmy Lin
Large Language Model Low Resource Language Document Re Ranking Cross Lingual Information Retrieval

December 21, 2023

Typhoon: Thai Large Language Models
Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarnmongkol, Ruangsak Patomwong, Pathomporn Chokchainant, Kasima Tharnpipitchai
Large Language Model Low Resource Language Instruction Dataset Tropical Cyclone Thai Language