Human Evaluation
Human evaluation in the field of artificial intelligence, particularly concerning large language models (LLMs), focuses on developing reliable and efficient methods to assess model performance against human expectations. Current research emphasizes creating standardized evaluation frameworks, often incorporating LLM-as-a-judge approaches to automate the process, while simultaneously addressing biases and inconsistencies in both human and automated assessments. This work is crucial for improving the trustworthiness and practical applicability of LLMs across diverse domains, from medical diagnosis to scientific synthesis, by ensuring that AI systems align with human needs and values. The development of robust evaluation methods is essential for responsible AI development and deployment.
Papers
A Survey of AI-Generated Video Evaluation
Xiao Liu, Xinhao Xiang, Zizhong Li, Yongheng Wang, Zhuoheng Li, Zhuosheng Liu, Weidi Zhang, Weiqi Ye, Jiawei Zhang
Improving Model Factuality with Fine-grained Critique-based Evaluator
Yiqing Xie, Wenxuan Zhou, Pradyot Prakash, Di Jin, Yuning Mao, Quintin Fettes, Arya Talebzadeh, Sinong Wang, Han Fang, Carolyn Rose, Daniel Fried, Hejia Zhang
Level of agreement between emotions generated by Artificial Intelligence and human evaluation: a methodological proposal
Miguel Carrasco, Cesar Gonzalez-Martin, Sonia Navajas-Torrente, Raul Dastres
Multi-Facet Counterfactual Learning for Content Quality Evaluation
Jiasheng Zheng, Hongyu Lin, Boxi Cao, Meng Liao, Yaojie Lu, Xianpei Han, Le Sun