LLM a a Judge
Large language models (LLMs) are increasingly used as automated evaluators ("LLM-as-a-Judge") for various tasks, aiming to replace or supplement human judgment in assessing the quality of other LLMs' outputs. Current research focuses on improving the reliability and reducing biases in these LLM judges, often employing techniques like Minimum Bayes Risk decoding and response-adapted references to enhance accuracy and alignment with human preferences. This approach offers a cost-effective and scalable alternative to human evaluation, with significant implications for benchmarking, model training (e.g., reinforcement learning from human feedback), and the development of more aligned and robust AI systems.
Papers
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Ravi Raju, Swayambhoo Jain, Bo Li, Jonathan Li, Urmish Thakker
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar