Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability
Parker Seegmiller, Joseph Gatto, Sarah Masud Preum
EasyECR: A Library for Easy Implementation and Evaluation of Event Coreference Resolution Models
Yuncong Li, Tianhua Xu, Sheng-hua Zhong, Haiqin Yang
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
Xiaoze Liu, Ting Sun, Tianyang Xu, Feijie Wu, Cunxiang Wang, Xiaoqian Wang, Jing Gao
Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review
Debalina Ghosh Paul, Hong Zhu, Ian Bayley
Improving the Evaluation and Actionability of Explanation Methods for Multivariate Time Series Classification
Davide Italo Serramazza, Thach Le Nguyen, Georgiana Ifrim
Automatic benchmarking of large multimodal models via iterative experiment programming
Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano Mancini, Elisa Ricci
A Benchmark for Maximum Cut: Towards Standardization of the Evaluation of Learned Heuristics for Combinatorial Optimization
Ankur Nath, Alan Kuhnle
Evaluation of Large Language Models: STEM education and Gender Stereotypes
Smilla Due, Sneha Das, Marianne Andersen, Berta Plandolit López, Sniff Andersen Nexø, Line Clemmensen
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe
Rethinking the Evaluation of Out-of-Distribution Detection: A Sorites Paradox
Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang
What is the best model? Application-driven Evaluation for Large Language Models
Shiguo Lian, Kaikai Zhao, Xinhui Liu, Xuejiao Lei, Bikun Yang, Wenjing Zhang, Kai Wang, Zhaoxiang Liu
A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations
Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang
Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation
Claude Formanek, Callum Rhys Tilbury, Louise Beyers, Jonathan Shock, Arnu Pretorius
Word Order in English-Japanese Simultaneous Interpretation: Analyses and Evaluation using Chunk-wise Monotonic Translation
Kosuke Doi, Yuka Ko, Mana Makinae, Katsuhito Sudoh, Satoshi Nakamura