Multilingual Dataset
Multilingual datasets are collections of text and/or speech data spanning multiple languages, aiming to improve the performance and cross-lingual capabilities of language models. Current research focuses on creating high-quality, diverse datasets for various tasks, including machine translation, sentiment analysis, and speech emotion recognition, often employing techniques like parameter-efficient transfer learning and leveraging pre-trained models such as BERT and Whisper. These datasets are crucial for developing more robust and inclusive language technologies, addressing the limitations of English-centric models and enabling applications in diverse linguistic and cultural contexts.
Papers
YT-30M: A multi-lingual multi-category dataset of YouTube comments
Hridoy Sankar Dutta
LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings
Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, Sara Hooker
Comparative Study of Multilingual Idioms and Similes in Large Language Models
Paria Khoshtab, Danial Namazifard, Mostafa Masoudi, Ali Akhgary, Samin Mahdizadeh Sani, Yadollah Yaghoobzadeh
Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model
Divyanshu Aggarwal, Sankarshan Damle, Navin Goyal, Satya Lokam, Sunayana Sitaram