Automatic Evaluation
Automatic evaluation of generated text and other outputs from AI models, particularly large language models (LLMs), aims to create objective and efficient alternatives to expensive and time-consuming human assessment. Current research focuses on developing new metrics and frameworks that better correlate with human judgment, often leveraging LLMs themselves as "judges" or incorporating techniques like instruction tuning and preference optimization. These advancements are crucial for accelerating the development and deployment of AI systems across diverse fields, from scientific protocol generation to medical diagnosis and education, by providing reliable and scalable evaluation methods.
Papers
Less is More for Improving Automatic Evaluation of Factual Consistency
Tong Wang, Ninad Kulkarni, Yanjun Qi
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Zhengran Zeng, Wei Ye, Jindong Wang, Yue Zhang, Shikun Zhang