Paper ID: 2405.02559

A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review

Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, Yanshan Wang

With generative artificial intelligence (AI), particularly large language models (LLMs), continuing to make inroads in healthcare, it is critical to supplement traditional automated evaluations with human evaluations. Understanding and evaluating the output of LLMs is essential to assuring safety, reliability, and effectiveness. However, human evaluation's cumbersome, time-consuming, and non-standardized nature presents significant obstacles to comprehensive evaluation and widespread adoption of LLMs in practice. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare. We highlight a notable need for a standardized and consistent human evaluation approach. Our extensive literature search, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, includes publications from January 2018 to February 2024. The review examines the human evaluation of LLMs across various medical specialties, addressing factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Drawing on the diverse evaluation strategies employed in these studies, we propose a comprehensive and practical framework for human evaluation of LLMs: QUEST: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. This framework aims to improve the reliability, generalizability, and applicability of human evaluation of LLMs in different healthcare applications by defining clear evaluation dimensions and offering detailed guidelines.

Submitted: May 4, 2024