Multilingual Evaluation
Multilingual evaluation of large language models (LLMs) aims to assess their performance across diverse languages, going beyond the dominant English-centric benchmarks. Current research focuses on developing more comprehensive and representative multilingual datasets, evaluating various model architectures (including both open-source and proprietary models) on diverse tasks (e.g., question answering, translation, sentiment analysis), and analyzing performance disparities across languages with varying resource levels. This rigorous evaluation is crucial for identifying biases, improving model robustness, and ensuring equitable access to advanced language technologies across different linguistic communities.
Papers
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, Amin Ahmad
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Varun Gumma, Anandhita Raghunath, Mohit Jain, Sunayana Sitaram
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra