LLM a a Judge
Large language models (LLMs) are increasingly used as automated evaluators ("LLM-as-a-Judge") for various tasks, aiming to replace or supplement human judgment in assessing the quality of other LLMs' outputs. Current research focuses on improving the reliability and reducing biases in these LLM judges, often employing techniques like Minimum Bayes Risk decoding and response-adapted references to enhance accuracy and alignment with human preferences. This approach offers a cost-effective and scalable alternative to human evaluation, with significant implications for benchmarking, model training (e.g., reinforcement learning from human feedback), and the development of more aligned and robust AI systems.
Papers
Self-rationalization improves LLM as a fine-grained judge
Prapti Trivedi, Aditya Gulati, Oliver Molenschot, Meghana Arakkal Rajeev, Rajkumar Ramamurthy, Keith Stevens, Tanveesh Singh Chaudhery, Jahnavi Jambholkar, James Zou, Nazneen Rajani
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
Better Instruction-Following Through Minimum Bayes Risk
Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Pakazad, Graham Neubig
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, Xiangliang Zhang
Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge
Aparna Elangovan, Jongwoo Ko, Lei Xu, Mahsa Elyasi, Ling Liu, Sravan Bodapati, Dan Roth