LLM a a Judge
Large language models (LLMs) are increasingly used as automated evaluators ("LLM-as-a-Judge") for various tasks, aiming to replace or supplement human judgment in assessing the quality of other LLMs' outputs. Current research focuses on improving the reliability and reducing biases in these LLM judges, often employing techniques like Minimum Bayes Risk decoding and response-adapted references to enhance accuracy and alignment with human preferences. This approach offers a cost-effective and scalable alternative to human evaluation, with significant implications for benchmarking, model training (e.g., reinforcement learning from human feedback), and the development of more aligned and robust AI systems.
24papers
Papers - Page 3
October 3, 2024
Better Instruction-Following Through Minimum Bayes Risk
Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Pakazad, Graham NeubigJustice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla+1Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Bodapati, Dan Roth
August 16, 2024
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Ravi Raju, Swayambhoo Jain, Bo Li, Jonathan Li, Urmish ThakkerEvaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar
July 25, 2024
April 30, 2024