Human Evaluation

Human evaluation in the field of artificial intelligence, particularly concerning large language models (LLMs), focuses on developing reliable and efficient methods to assess model performance against human expectations. Current research emphasizes creating standardized evaluation frameworks, often incorporating LLM-as-a-judge approaches to automate the process, while simultaneously addressing biases and inconsistencies in both human and automated assessments. This work is crucial for improving the trustworthiness and practical applicability of LLMs across diverse domains, from medical diagnosis to scientific synthesis, by ensuring that AI systems align with human needs and values. The development of robust evaluation methods is essential for responsible AI development and deployment.

Papers

September 19, 2023

What is the Best Automated Metric for Text to Motion Generation?
Jordan Voas, Yili Wang, Qixing Huang, Raymond Mooney
Text Modality Evaluation Metric Human Evaluation Motion Generation Automatic Metric Skeleton Based Human Motion Effective Generative

September 14, 2023

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation
Sarah E. Finch, James D. Finch, Jinho D. Choi
Global Impact Human Evaluation Dialogue Evaluation Dialogue Quality

September 8, 2023

Human evaluation of robotic grippers for berry picking
Laura Alvarez-Hidalgo, Ian S. Howard
Human Evaluation Robotic Gripper Berry Picking Hedge Detection

September 6, 2023

Rubric-Specific Approach to Automated Essay Scoring with Augmentation Training
Brian Cho, Youngbin Jang, Jaewoong Yoon
Human Evaluation Automatic Evaluation Essay Scoring Subjective Evaluation Rubric Based

August 14, 2023

August 12, 2023

With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector
Ondřej Plátek, Mateusz Lango, Ondřej Dušek
Machine Translation Human Annotation Human Evaluation Research Reproducibility Author Name Little Help

August 11, 2023

Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters
Arne Bewersdorff, Kathrin Seßler, Armin Baur, Enkelejda Kasneci, Claudia Nerdel
Artificial Intelligence Artificial Intelligence System Human Evaluation Error Detection Student Error

August 10, 2023

Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation
Hao Liang, Pietro Perona, Guha Balakrishnan
Face Recognition Human Evaluation Synthetic Face Algorithmic Bias Perceptual Similarity Face Generator Experimental Approach

July 6, 2023

Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?
Luísa Shimabucoro, Timothy Hospedales, Henry Gouk
Human Evaluation Model Selection Cross Validation Performance Estimation Human Aligned Benchmark

June 23, 2023

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models
Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
Global Evaluation Human Evaluation Language Model Behavior

June 13, 2023

Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms
Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar
Machine Learning Human Evaluation Model Accuracy Slice Discovery

June 12, 2023

Evolving Testing Scenario Generation Method and Intelligence Evaluation Framework for Automated Vehicles
Yining Ma, Wei Jiang, Lingtong Zhang, Junyi Chen, Hong Wang, Chen Lv, Xuesong Wang, Lu Xiong
Deep Reinforcement Learning Automated Driving Human Evaluation Naturalistic Driving Cooperative Driving Scenario Generation Scenario Based Testing Driving Policy

May 29, 2023

Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui
Large Language Model Human Evaluation Bias Evaluation

May 24, 2023

May 23, 2023

Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
Lucy Lu Wang, Yulia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey E. Kuehl, Erin Bransom, Byron C. Wallace
Human Evaluation Multi Document Summarization Summarization Quality Automatic Metric Summarization Evaluation Medical Text Summarization Edit Summary

May 22, 2023

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, Lidong Bing
Large Language Model Abstractive Summarization Human Evaluation Abstractive Summary

May 15, 2023

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist
Iftitahu Ni'mah, Meng Fang, Vlado Menkovski, Mykola Pechenizkiy
Language Generation Human Evaluation Natural Language Generation Preference Rating Agnostic Metric Correlation Analysis

Human Evaluation

Papers

What is the Best Automated Metric for Text to Motion Generation?

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

Human evaluation of robotic grippers for berry picking

Rubric-Specific Approach to Automated Essay Scoring with Augmentation Training

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained Text Evaluation

With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector

Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters

Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation

Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms

Evolving Testing Scenario Generation Method and Intelligence Evaluation Framework for Automated Vehicles

Large Language Models are not Fair Evaluators

Learning Answer Generation using Supervision from Automatic Question Answering Evaluators

PLCMOS -- a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms

DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4

Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist