LLM Benchmark
LLM benchmarking aims to objectively evaluate the capabilities of large language models across diverse tasks, addressing limitations of existing methods like static datasets and potential biases in human or LLM evaluation. Current research focuses on developing more robust and dynamic benchmarks, including those based on real-world interactions, game-based competitions, and knowledge-grounded evaluations, often incorporating techniques like prompt engineering and multi-agent coordination. These efforts are crucial for fostering the responsible development and deployment of LLMs, improving model transparency, and guiding future research directions in AI.
Papers
Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction
Suma Bailis, Jane Friedhoff, Feiyang Chen
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen