Multilingual Dataset
Multilingual datasets are collections of text and/or speech data spanning multiple languages, aiming to improve the performance and cross-lingual capabilities of language models. Current research focuses on creating high-quality, diverse datasets for various tasks, including machine translation, sentiment analysis, and speech emotion recognition, often employing techniques like parameter-efficient transfer learning and leveraging pre-trained models such as BERT and Whisper. These datasets are crucial for developing more robust and inclusive language technologies, addressing the limitations of English-centric models and enabling applications in diverse linguistic and cultural contexts.
Papers
FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models
Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen
LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR
Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen