Evaluation Suite
Evaluation suites are comprehensive benchmarks designed to rigorously assess the capabilities and limitations of large language models (LLMs), including multimodal models. Current research focuses on developing suites that evaluate diverse aspects, such as reasoning, safety (including prompt injection and code interpreter abuse), cross-lingual performance, and robustness to variations in question phrasing or image manipulation. These suites are crucial for identifying strengths and weaknesses in LLMs, fostering improved model development, and ensuring responsible deployment across various applications and languages.
Papers
October 14, 2024
May 3, 2024
April 19, 2024
April 17, 2024
March 29, 2024
October 23, 2023
September 28, 2023
September 12, 2023
June 9, 2023