Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models
Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, Jiawei Han
Enhancing Role-playing Systems through Aggressive Queries: Evaluation and Improvement
Yihong Tang, Jiao Ou, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao
Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model
Salman Rahman, Lavender Yao Jiang, Saadia Gabriel, Yindalon Aphinyanaphongs, Eric Karl Oermann, Rumi Chunara
Distractor Generation for Multiple-Choice Questions: A Survey of Methods, Datasets, and Evaluation
Elaf Alhazmi, Quan Z. Sheng, Wei Emma Zhang, Munazza Zaib, Ahoud Alhazmi
Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models
Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, Ting Liu
Evaluation of Google's Voice Recognition and Sentence Classification for Health Care Applications
Majbah Uddin, Nathan Huynh, Jose M Vidal, Kevin M Taaffe, Lawrence D Fredendall, Joel S Greenstein
Optical Tactile Sensing for Aerial Multi-Contact Interaction: Design, Integration, and Evaluation
Emanuele Aucone, Carmelo Sferrazza, Manuel Gregor, Raffaello D'Andrea, Stefano Mintchev
Evaluation in Neural Style Transfer: A Review
Eleftherios Ioannou, Steve Maddock
Evaluation of Out-of-Distribution Detection Performance on Autonomous Driving Datasets
Jens Henriksson, Christian Berger, Stig Ursing, Markus Borg
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu