LLM a a Judge

Large language models (LLMs) are increasingly used as automated evaluators ("LLM-as-a-Judge") for various tasks, aiming to replace or supplement human judgment in assessing the quality of other LLMs' outputs. Current research focuses on improving the reliability and reducing biases in these LLM judges, often employing techniques like Minimum Bayes Risk decoding and response-adapted references to enhance accuracy and alignment with human preferences. This approach offers a cost-effective and scalable alternative to human evaluation, with significant implications for benchmarking, model training (e.g., reinforcement learning from human feedback), and the development of more aligned and robust AI systems.

Papers