LLM Evaluator
LLM evaluators are large language models (LLMs) designed to assess the quality of text generated by other LLMs, addressing the high cost and subjectivity of human evaluation. Current research focuses on improving the accuracy and reliability of these evaluators by mitigating biases (e.g., position bias, token count bias, self-preference), enhancing alignment with human judgments, and exploring diverse architectures such as ensembles of smaller models or hierarchical decomposition of evaluation criteria. This field is crucial for advancing LLM development, enabling more objective benchmarking and facilitating the responsible deployment of LLMs across various applications.
Papers
Large Language Models are Inconsistent and Biased Evaluators
Rickard Stureborg, Dimitris Alikaniotis, Yoshi Suhara
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo