LLM Based Evaluation
LLM-based evaluation focuses on using large language models (LLMs) to assess the performance of other LLMs, automating a traditionally labor-intensive process. Current research emphasizes improving the reliability and interpretability of these evaluations, exploring techniques like checklist generation, prompt engineering, and the integration of multiple LLM evaluators to achieve higher agreement with human judgments across diverse tasks and languages. This field is crucial for advancing LLM development and deployment, enabling more objective comparisons of model capabilities and identifying biases or weaknesses in existing models, ultimately leading to more robust and beneficial AI systems.
Papers
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation
Jonathan Cook, Tim Rocktäschel, Jakob Foerster, Dennis Aumiller, Alex Wang
What do Large Language Models Need for Machine Translation Evaluation?
Shenbin Qian, Archchana Sindhujan, Minnie Kabra, Diptesh Kanojia, Constantin Orăsan, Tharindu Ranasinghe, Frédéric Blain
AIME: AI System Optimization via Multiple LLM Evaluators
Bhrij Patel, Souradip Chakraborty, Wesley A. Suttle, Mengdi Wang, Amrit Singh Bedi, Dinesh Manocha