Evaluation Method

Evaluating the performance of increasingly complex AI models, particularly large language models (LLMs) and other generative AI systems, is a critical and evolving field of research. Current efforts focus on developing more robust and comprehensive evaluation methods that move beyond simple accuracy metrics, incorporating human judgment, system-centric and user-centric factors, and addressing biases and limitations in existing benchmarks. These improved evaluation techniques are essential for ensuring the reliability, fairness, and responsible deployment of AI systems across diverse applications, ultimately shaping the future of AI development and its societal impact.

Papers

May 17, 2024

SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation
Ziyao Xu, Houfeng Wang
Large Language Model Language Model Compositional Generalization Evaluation Method Data to Text Generation Rule Based

May 15, 2024

Influence Maximization in Hypergraphs Using A Genetic Algorithm with New Initialization and Evaluation Methods
Xilong Qu, Wenbin Pei, Yingchao Yang, Xirong Xu, Renquan Zhang, Qiang Zhang
Genetic Algorithm Nested Hypergraphs New Initialization Evaluation Method Influence Maximization Cascade Model Influential Node

May 4, 2024

A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review
Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V Stolyar, Katelyn Polanska, Karleigh R McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, Yifan Peng, Yanshan Wang
New Framework Healthcare System Human Evaluation Generative Large Language Model Evaluation Method

April 25, 2024

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice
Juri Opitz
Natural Language Processing Glance Annotation Evaluation Method Critical Reflection Classification Metric Adaptive Metric

April 4, 2024

Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors
Chen Huang, Peixin Qin, Yang Deng, Wenqiang Lei, Jiancheng Lv, Tat-Seng Chua
High Impact Concept Evaluation Method User Experience Conversational Recommender System Evaluation Protocol

March 21, 2024

Estimating Causal Effects with Double Machine Learning -- A Method Evaluation
Jonathan Fuhr, Philipp Berens, Dominik Papies
Machine Learning Estimation Task Causal Effect Causal Structure Causal Effect Estimation Evaluation Method Double Machine Learning

February 16, 2024

Optimizing Warfarin Dosing Using Contextual Bandit: An Offline Policy Learning and Evaluation Method
Yong Huang, Charles A. Downs, Amir M. Rahmani
Reinforcement Learning Evaluation Method Based Policy Offline Policy Learning Contextual Bandit Setting Dosing Model Warfarin Dosing

February 1, 2024

Evaluation Methodology for Large Language Models for Multilingual Document Question and Answer
Adar Kahana, Jaya Susan Mathew, Said Bleik, Jeremy Reynolds, Oren Elisha
Large Language Model Large Language Top Two Answer Evaluation Method Multilingual Capability Open Domain Question Language User

January 30, 2024

Evaluation in Neural Style Transfer: A Review
Eleftherios Ioannou, Steve Maddock
Global Evaluation Narrative Review Evaluation Metric Evaluation Method Neural Style Transfer Photorealistic Image

December 26, 2023

A Comprehensive Survey of Evaluation Techniques for Recommendation Systems
Aryan Jadon, Avinash Patil
Comprehensive Survey Recommendation System Evaluation Method Rank Based

December 21, 2023

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations
Anouck Braggaar, Christine Liebrecht, Emiel van Miltenburg, Emiel Krahmer
Dialogue System Systematic Review Task Oriented Dialogue System Evaluation Method New Measure Dialogue Evaluation

November 3, 2023

Post Turing: Mapping the landscape of LLM Evaluation
Alexey Tikhonov, Ivan P. Yamshchikov
Area MaPPing Artificial Intelligence Research Landscape Image Evaluation Method AI Community Turing Machine Unified Evaluation

November 2, 2023

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model
Jianghao Chen, Pu Jian, Tengxiao Xi, Dongyi Yi, Qianlong Du, Chenglin Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, Jiajun Zhang
Large Language Model Chinese Character Evaluation Method Noisy Web

October 18, 2023

Revisiting Transferable Adversarial Image Examples: Attack Categorization, Evaluation Guidelines, and New Insights
Zhengyu Zhao, Hanwei Zhang, Renjue Li, Ronan Sicre, Laurent Amsaleg, Michael Backes, Qi Li, Chao Shen
Adversarial Example Evaluation Method Black Box Attack Transferable Attack New Insight Fairness Attack

October 10, 2023

Test & Evaluation Best Practices for Machine Learning-Enabled Systems
Jaganmohan Chandrasekaran, Tyler Cody, Nicola McCarthy, Erin Lanus, Laura Freeman
Machine Learning Machine Learning Model New Machine Level Test Evaluation Method Machine Learning Based System Learning Enabled

September 21, 2023

On the Definition of Appropriate Trust and the Tools that Come with it
Helena Löfström
Appropriate Trust Explanation Method Human AI Interaction Definition Defining Redefinition Model Evaluation Evaluation Method Easy Tool Explanation Quality

September 5, 2023

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion
Wen-Chin Huang, Tomoki Toda
Voice Conversion Evaluation Method Accented Speech Native Speaker Accent Conversion

September 4, 2023

Design of Recognition and Evaluation System for Table Tennis Players' Motor Skills Based on Artificial Intelligence
Zhuo-yong Shi, Ye-tao Jia, Ke-xin Zhang, Ding-han Wang, Long-meng Ji, Yong Wu
Artificial Intelligence Recognition Rate Wearable Device Evaluation Method Table Tennis Motion Feature Motor Skill

August 28, 2023

Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond
Oren Barkan, Tal Reiss, Jonathan Weill, Ori Katz, Roy Hirsch, Itzik Malkiel, Noam Koenigstein
New Benchmark Image Pair Evaluation Method Benchmark Image Perceptual Similarity Efficient Discovery Visual Similarity

August 26, 2023

A Comprehensive Survey for Evaluation Methodologies of AI-Generated Music
Zeyu Xiong, Weitao Wang, Jing Yu, Yue Lin, Ziyan Wang
Comprehensive Survey Music Generation Evaluation Method

Evaluation Method

Papers

SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

Influence Maximization in Hypergraphs Using A Genetic Algorithm with New Initialization and Evaluation Methods

A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors

Estimating Causal Effects with Double Machine Learning -- A Method Evaluation

Optimizing Warfarin Dosing Using Contextual Bandit: An Offline Policy Learning and Evaluation Method

Evaluation Methodology for Large Language Models for Multilingual Document Question and Answer

Evaluation in Neural Style Transfer: A Review

A Comprehensive Survey of Evaluation Techniques for Recommendation Systems

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Post Turing: Mapping the landscape of LLM Evaluation

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Revisiting Transferable Adversarial Image Examples: Attack Categorization, Evaluation Guidelines, and New Insights

Test & Evaluation Best Practices for Machine Learning-Enabled Systems

On the Definition of Appropriate Trust and the Tools that Come with it

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

Design of Recognition and Evaluation System for Table Tennis Players' Motor Skills Based on Artificial Intelligence

Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond

A Comprehensive Survey for Evaluation Methodologies of AI-Generated Music