Automatic Evaluation
Automatic evaluation of generated text and other outputs from AI models, particularly large language models (LLMs), aims to create objective and efficient alternatives to expensive and time-consuming human assessment. Current research focuses on developing new metrics and frameworks that better correlate with human judgment, often leveraging LLMs themselves as "judges" or incorporating techniques like instruction tuning and preference optimization. These advancements are crucial for accelerating the development and deployment of AI systems across diverse fields, from scientific protocol generation to medical diagnosis and education, by providing reliable and scalable evaluation methods.
Papers
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
Cyril Chhun, Pierre Colombo, Chloé Clavel, Fabian M. Suchanek
Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer
Fengji Zhang, Jin Liu, Yao Wan, Xiao Yu, Xiao Liu, Jacky Keung