Multilingual Evaluation
Multilingual evaluation of large language models (LLMs) aims to assess their performance across diverse languages, going beyond the dominant English-centric benchmarks. Current research focuses on developing more comprehensive and representative multilingual datasets, evaluating various model architectures (including both open-source and proprietary models) on diverse tasks (e.g., question answering, translation, sentiment analysis), and analyzing performance disparities across languages with varying resource levels. This rigorous evaluation is crucial for identifying biases, improving model robustness, and ensuring equitable access to advanced language technologies across different linguistic communities.
Papers
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?
Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, Annie En-Shiun Lee