LLM Evaluation
Evaluating large language models (LLMs) focuses on establishing their reliability, safety, and suitability for various applications. Current research emphasizes developing robust and comprehensive evaluation frameworks, moving beyond simple accuracy metrics to assess aspects like data privacy, bias, explainability, and the ability to combine different skills. This rigorous evaluation is crucial for responsible LLM development and deployment, informing both the scientific understanding of these models and their safe integration into real-world applications across diverse fields.
Papers
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr, João F. Henriques, Jakob N. Foerster
The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
Bhashithe Abeysinghe, Ruhan Circi
Large Language Models as Evaluators for Recommendation Explanations
Xiaoyu Zhang, Yishan Li, Jiayin Wang, Bowen Sun, Weizhi Ma, Peijie Sun, Min Zhang
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions
Ruochen Zhao, Wenxuan Zhang, Yew Ken Chia, Weiwen Xu, Deli Zhao, Lidong Bing
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, Wei Lin