Meta Evaluation
Meta-evaluation in the context of large language models (LLMs) focuses on assessing the reliability and effectiveness of automated methods used to evaluate LLM outputs, often using other LLMs as "judges." Current research emphasizes developing robust and unbiased automated evaluators, addressing issues like bias towards longer responses, inconsistent performance across languages and tasks, and the need for more fine-grained analysis of specific error types. This work is crucial for improving LLM development and deployment by providing more reliable and efficient evaluation methods, ultimately leading to more trustworthy and effective AI systems.
Papers
IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages
Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
Liang Ma, Shuyang Cao, Robert L. Logan, Di Lu, Shihao Ran, Ke Zhang, Joel Tetreault, Alejandro Jaimes