Code Switched Data

Code-switched data, encompassing text and speech where multiple languages are interwoven within a single utterance, presents a significant challenge and opportunity for natural language processing. Current research focuses on mitigating data scarcity for low-resource languages through techniques like data augmentation using large language models (e.g., GPT) and fine-tuning pre-trained multilingual models (e.g., wav2vec 2.0 XLSR) or adapting existing multilingual models for code-switching. These efforts aim to improve the performance of various NLP tasks, including speech recognition, machine translation, and information retrieval, ultimately leading to more inclusive and accurate language technologies for multilingual communities.

Papers