Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He
VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal
Can GNNs Learn Link Heuristics? A Concise Review and Evaluation of Link Prediction Methods
Shuming Liang, Yu Ding, Zhidong Li, Bin Liang, Siqi Zhang, Yang Wang, Fang Chen
Creation and Evaluation of a Food Product Image Dataset for Product Property Extraction
Christoph Brosch, Alexander Bouwens, Sebastian Bast, Swen Haab, Rolf Krieger
Scaling up the Evaluation of Collaborative Problem Solving: Promises and Challenges of Coding Chat Data with ChatGPT
Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi, Lei Liu, Michael Flor
Rethinking CyberSecEval: An LLM-Aided Approach to Evaluation Critique
Suhas Hariharan, Zainab Ali Majid, Jaime Raldua Veuthey, Jacob Haimes
Optimizing Automatic Summarization of Long Clinical Records Using Dynamic Context Extension:Testing and Evaluation of the NBCE Method
Guoqing Zhang, Keita Fukuyama, Kazumasa Kishimoto, Tomohiro Kuroda
An Axiomatic Study of the Evaluation of Enthymeme Decoding in Weighted Structured Argumentation
Jonathan Ben-Naim, Victor David, Anthony Hunter
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, Himabindu Lakkaraju
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus
Evaluation of handwriting kinematics and pressure for differential diagnosis of Parkinson's disease
Peter Drotár, Jiří Mekyska, Irena Rektorová, Lucia Masarová, Zdeněk Smékal, Marcos Faundez-Zanuy
Conditional Vendi Score: An Information-Theoretic Approach to Diversity Evaluation of Prompt-based Generative Models
Mohammad Jalali, Azim Ospanov, Amin Gohari, Farzan Farnia