Human Evaluation

Human evaluation in the field of artificial intelligence, particularly concerning large language models (LLMs), focuses on developing reliable and efficient methods to assess model performance against human expectations. Current research emphasizes creating standardized evaluation frameworks, often incorporating LLM-as-a-judge approaches to automate the process, while simultaneously addressing biases and inconsistencies in both human and automated assessments. This work is crucial for improving the trustworthiness and practical applicability of LLMs across diverse domains, from medical diagnosis to scientific synthesis, by ensuring that AI systems align with human needs and values. The development of robust evaluation methods is essential for responsible AI development and deployment.

Papers

May 2, 2023

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi Yang
NLP Field Human Evaluation Research Reproducibility

April 11, 2023

Approximating Online Human Evaluation of Social Chatbots with Prompting
Ekaterina Svikhnushina, Pearl Pu
Human Evaluation Conversational Model Interactive Evaluation Conversational Chatbots Social Chatbots Dialog Evaluation

March 28, 2023

Facial recognition technology and human raters can predict political orientation from images of expressionless faces even when controlling for demographics and self-presentation
Michal Kosinski, Poruz Khambatta, Yilun Wang
Face Recognition Face Image Human Evaluation Emotional Expression Facial Attribute Political Orientation Political Compass Self Presentation

February 28, 2023

Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Tom Kocmi, Christian Federmann
Human Evaluation Translation Quality Generative Large Language Model GPT Based

January 30, 2023

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization
Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo
Human Evaluation State Aware Guideline Generated Summary Long Text Summarization

January 29, 2023

EMP-EVAL: A Framework for Measuring Empathy in Open Domain Dialogues
Bushra Amjad, Muhammad Zeeshan, Mirza Omer Beg
New Framework Human Evaluation Open Domain Cognitive Empathy Emotion Analysis Empathy Detection Error Vector Assisted Learning

December 20, 2022

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis
Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang, Derek F. Wong, Dacheng Tao
Machine Translation Language Generation Human Evaluation Error Analysis Supervised Metric Data to Text Error Classification

December 15, 2022

TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models
Federico A. Galatolo, Mario G. C. A. Cimino, Edoardo Cogotti
Text to Image Model Human Evaluation Long Form Novel Novel Evaluation G Eval CLIP Score Evaluation Data

December 12, 2022

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators
Pietro Liguori, Cristina Improta, Roberto Natella, Bojan Cukic, Domenico Cotroneo
Human Evaluation Automatic Metric Similarity Metric Code Generator Offensive Content Detection

November 22, 2022

Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark
Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer, Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, Iddo Drori
Text to Image Generation Text to Image Model Text to Image Human Evaluation Multi Task Benchmark Image Text Benchmark

November 17, 2022

October 20, 2022

Searching for a higher power in the human evaluation of MT
Johnny Tian-Zheng Wei, Tom Kocmi, Christian Federmann
Real Power Direct Assessment Human Evaluation Pairwise Comparison MT Evaluation Power Prediction

September 26, 2022

USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations
Sophie Ostmeier, Brian Axelrod, Jeroen Bertels, Fabian Isensee, Maarten G. Lansberg, Soren Christensen, Gregory W. Albers, Li-Jia Li, Jeremy J. Heit
Medical Image Segmentation Segmentation Task Human Evaluation Performance Metric Selective Annotation Uncertain Reasoning Target Annotation Medical Image Segmentation Model

September 14, 2022

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Michael Chinen, Jan Skoglund, Chandan K A Reddy, Alessandro Ragano, Andrew Hines
Data Set Human Evaluation Speech Quality Metadata Information Variance Information Non Intrusive Speech

September 12, 2022

Open-Domain Dialog Evaluation using Follow-Ups Likelihood
Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann, Walter Daelemans
Language Model Human Annotation Human Evaluation Open Domain Automatic Evaluation Open Domain Dialogue

September 7, 2022

INFACT: An Online Human Evaluation Framework for Conversational Recommendation
Ahtsham Manzoor, Dietmar jannach
Human Evaluation Factual Claim Conversational Recommendation Multi Turn Conversation Conversational Recommender System Recommendation Policy Interactive Agent

August 31, 2022

The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo, Maxime Peyrard, Nathan Noiry, Robert West, Pablo Piantanida
Language Generation Human Evaluation Automatic Evaluation Human Judgment Automatic Evaluation Metric Automatic Metric

August 24, 2022

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
Cyril Chhun, Pierre Colombo, Chloé Clavel, Fabian M. Suchanek
New Benchmark Global Evaluation Human Evaluation Automatic Evaluation Story Generation Automatic Metric Likely Criterion

July 11, 2022

Interpretability by design using computer vision for behavioral sensing in child and adolescent psychiatry
Flavia D. Frumosu, Nicole N. Lønfeldt, A. -R. Cecilie Mora-Jensen, Sneha Das, Nicklas Leander Lund, A. Katrine Pagsberg, Line K. H. Clemmensen
Computer Vision Inherent Interpretability Human Evaluation Nine Year Old Child Human Behavior Near Optimal Virtual Rating

Human Evaluation

Papers

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Approximating Online Human Evaluation of Social Chatbots with Prompting

Facial recognition technology and human raters can predict political orientation from images of expressionless faces even when controlling for demographics and self-presentation

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

EMP-EVAL: A Framework for Measuring Empathy in Open Domain Dialogues

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

ProtSi: Prototypical Siamese Network with Data Augmentation for Few-Shot Subjective Answer Evaluation

Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation

Searching for a higher power in the human evaluation of MT

USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

INFACT: An Online Human Evaluation Framework for Conversational Recommendation

The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Interpretability by design using computer vision for behavioral sensing in child and adolescent psychiatry