MT Bench
MT Bench, encompassing a broad range of benchmarks, aims to rigorously evaluate the capabilities of large language and multimodal models across diverse tasks, including visual perception, video quality assessment, and dialogue generation. Current research focuses on developing standardized benchmarks with fine-grained difficulty annotations and realistic validation procedures, often employing large language models themselves for evaluation. These benchmarks are crucial for identifying limitations in current models and guiding the development of more robust and reliable AI systems with improved generalization and safety, impacting various fields from healthcare to scientific research.
Papers
{\mu}-Bench: A Vision-Language Benchmark for Microscopy Understanding
Alejandro Lozano, Jeffrey Nirschl, James Burgess, Sanket Rajan Gupte, Yuhui Zhang, Alyssa Unell, Serena Yeung-Levy
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, Hongquan Lin, Jiaming Li, Yuansheng Ni, Haihong Wu, Yaswanth Narsupalli, Zhigang Zheng, Chengming Li, Xiping Hu, Ruifeng Xu, Xiaojun Chen, Min Yang, Jiaheng Liu, Ruibo Liu, Wenhao Huang, Ge Zhang, Shiwen Ni
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, Zhongyu Wei
DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime
Zhiyao Luo, Mingcheng Zhu, Fenglin Liu, Jiali Li, Yangchen Pan, Jiandong Zhou, Tingting Zhu
C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models
Jiahuan Cao, Yongxin Shi, Dezhi Peng, Yang Liu, Lianwen Jin