Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
Evaluation is all you need. Prompting Generative Large Language Models for Annotation Tasks in the Social Sciences. A Primer using Open Models
Maximilian Weber, Merle Reichardt
An $\ell^1$-Plug-and-Play Approach for MPI Using a Zero Shot Denoiser with Evaluation on the 3D Open MPI Dataset
Vladyslav Gapyak, Corinna Rentschler, Thomas März, Andreas Weinmann
How to Evaluate Coreference in Literary Texts?
Ana-Isabel Duron-Tejedor, Pascal Amsili, Thierry Poibeau
How much can change in a year? Revisiting Evaluation in Multi-Agent Reinforcement Learning
Siddarth Singh, Omayma Mahjoub, Ruan de Kock, Wiem Khlifi, Abidine Vall, Kale-ab Tessera, Arnu Pretorius
Enhancing Robotic Navigation: An Evaluation of Single and Multi-Objective Reinforcement Learning Strategies
Vicki Young, Jumman Hossain, Nirmalya Roy
PromptBench: A Unified Library for Evaluation of Large Language Models
Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie
Evaluation of Infrastructure-based Warning System on Driving Behaviors-A Roundabout Study
Cong Zhang, Chi Tian, Tianfang Han, Hang Li, Yiheng Feng, Yunfeng Chen, Robert W. Proctor, Jiansong Zhang
Evaluation of Active Feature Acquisition Methods for Static Feature Settings
Henrik von Kleist, Alireza Zamanian, Ilya Shpitser, Narges Ahmidi
Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning
Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, Zhenhua Feng, Muhammad Awais
Kattis vs. ChatGPT: Assessment and Evaluation of Programming Tasks in the Age of Artificial Intelligence
Nora Dunder, Saga Lundborg, Olga Viberg, Jacqueline Wong