Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
A Hybrid Real-Time Framework for Efficient Fussell-Vesely Importance Evaluation Using Virtual Fault Trees and Graph Neural Networks
Xingyu Xiao, Peng Chen
Virtualization & Microservice Architecture for Software-Defined Vehicles: An Evaluation and Exploration
Long Wen, Markus Rickert, Fengjunjie Pan, Jianjie Lin, Yu Zhang, Tobias Betz, Alois Knoll
Frechet Music Distance: A Metric For Generative Symbolic Music Evaluation
Jan Retkowski, Jakub Stępniak, Mateusz Modrzejewski
CAP: Evaluation of Persuasive and Creative Image Generation
Aysan Aghazadeh, Adriana Kovashka
Benchmark for Evaluation and Analysis of Citation Recommendation Models
Puja Maharjan
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim, Juyoung Suk, Seungone Kim, Niklas Muennighoff, Dongkwan Kim, Alice Oh
Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation
Lorenzo Cima, Alessio Miaschi, Amaury Trujillo, Marco Avvenuti, Felice Dell'Orletta, Stefano Cresci
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications
Raphael Shu, Nilaksh Das, Michelle Yuan, Monica Sunkara, Yi Zhang
C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
Yanyang Li, Tin Long Wong, Cheung To Hung, Jianqiao Zhao, Duo Zheng, Ka Wai Liu, Michael R. Lyu, Liwei Wang
Assessing Similarity Measures for the Evaluation of Human-Robot Motion Correspondence
Charles Dietzel, Patrick J. Martin
Good practices for evaluation of machine learning systems
Luciana Ferrer, Odette Scharenborg, Tom Bäckström
WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis
Chengwei Hu, Jianhui Zheng, Yancheng He, Hangyu Guo, Junguang Jiang, Han Zhu, Kai Sun, Yuning Jiang, Wenbo Su, Bo Zheng
Human-centred test and evaluation of military AI
David Helmer, Michael Boardman, S. Kate Conroy, Adam J. Hepworth, Manoj Harjani
SiTSE: Sinhala Text Simplification Dataset and Evaluation
Surangika Ranathunga, Rumesh Sirithunga, Himashi Rathnayake, Lahiru De Silva, Thamindu Aluthwala, Saman Peramuna, Ravi Shekhar