Evaluation Benchmark
Evaluation benchmarks are crucial for assessing the performance of large language models (LLMs) and other AI systems across diverse tasks, providing objective measures of capabilities and identifying areas for improvement. Current research focuses on developing comprehensive benchmarks that address various challenges, including data contamination, bias, and the evaluation of specific model functionalities (e.g., tool use, image editing, and video analysis), often incorporating novel metrics and datasets. These benchmarks are vital for fostering reproducible research, enabling fair comparisons between models, and ultimately driving the development of more robust and reliable AI systems with real-world applications.
Papers
EvalGIM: A Library for Evaluating Generative Image Models
Melissa Hall, Oscar Mañas, Reyhane Askari, Mark Ibrahim, Candace Ross, Pietro Astolfi, Tariq Berrada Ifriqi, Marton Havasi, Yohann Benchetrit, Karen Ullrich, Carolina Braga, Abhishek Charnalia, Maeve Ryan, Mike Rabbat, Michal Drozdzal, Jakob Verbeek, Adriana Romero Soriano
Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics
Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, Mykel J. Kochenderfer