Multilingual Dataset

Multilingual datasets are collections of text and/or speech data spanning multiple languages, aiming to improve the performance and cross-lingual capabilities of language models. Current research focuses on creating high-quality, diverse datasets for various tasks, including machine translation, sentiment analysis, and speech emotion recognition, often employing techniques like parameter-efficient transfer learning and leveraging pre-trained models such as BERT and Whisper. These datasets are crucial for developing more robust and inclusive language technologies, addressing the limitations of English-centric models and enabling applications in diverse linguistic and cultural contexts.

Papers