Human Evaluation
Human evaluation in the field of artificial intelligence, particularly concerning large language models (LLMs), focuses on developing reliable and efficient methods to assess model performance against human expectations. Current research emphasizes creating standardized evaluation frameworks, often incorporating LLM-as-a-judge approaches to automate the process, while simultaneously addressing biases and inconsistencies in both human and automated assessments. This work is crucial for improving the trustworthiness and practical applicability of LLMs across diverse domains, from medical diagnosis to scientific synthesis, by ensuring that AI systems align with human needs and values. The development of robust evaluation methods is essential for responsible AI development and deployment.
Papers
Synthetic Dialogue Dataset Generation using LLM Agents
Yelaman Abdullin, Diego Molla-Aliod, Bahadorreza Ofoghi, John Yearwood, Qingyang Li
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu
Fusion-Eval: Integrating Assistant Evaluators with LLMs
Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng
StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving
Chang Gao, Haiyun Jiang, Deng Cai, Shuming Shi, Wai Lam
The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony and Sarcasm Generation
Tyler Loakman, Aaron Maladry, Chenghua Lin
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs
Shuyi Xie, Wenlin Yao, Yong Dai, Shaobo Wang, Donlin Zhou, Lifeng Jin, Xinhua Feng, Pengzhi Wei, Yujie Lin, Zhichao Hu, Dong Yu, Zhengyou Zhang, Jing Nie, Yuhong Liu