Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies
Lachlan McGinness, Peter Baumgartner
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu
Evaluation of RAG Metrics for Question Answering in the Telecom Domain
Sujoy Roychowdhury, Sumit Soman, H G Ranjani, Neeraj Gunda, Vansh Chhabra, Sai Krishna Bala
An evaluation of CNN models and data augmentation techniques in hierarchical localization of mobile robots
J. J. Cabrera, O. J. Céspedes, S. Cebollada, O. Reinoso, L. Payá
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Yifan Song, Guoyin Wang, Sujian Li, Bill Yuchen Lin
Advancements in Recommender Systems: A Comprehensive Analysis Based on Data, Algorithms, and Evaluation
Xin Ma, Mingyue Li, Xuguang Liu
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
Jin Liu, Qingquan Li, Wenlong Du
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
Krishnaram Kenthapadi, Mehrnoosh Sameki, Ankur Taly
Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis
Jianxiang Yu, Zichen Ding, Jiaqi Tan, Kangyang Luo, Zhenmin Weng, Chenghua Gong, Long Zeng, Renjing Cui, Chengcheng Han, Qiushi Sun, Zhiyong Wu, Yunshi Lan, Xiang Li
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, Limin Wang