MT Bench
MT Bench, encompassing a broad range of benchmarks, aims to rigorously evaluate the capabilities of large language and multimodal models across diverse tasks, including visual perception, video quality assessment, and dialogue generation. Current research focuses on developing standardized benchmarks with fine-grained difficulty annotations and realistic validation procedures, often employing large language models themselves for evaluation. These benchmarks are crucial for identifying limitations in current models and guiding the development of more robust and reliable AI systems with improved generalization and safety, impacting various fields from healthcare to scientific research.
Papers
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan
HS3-Bench: A Benchmark and Strong Baseline for Hyperspectral Semantic Segmentation in Driving Scenarios
Nick Theisen, Robin Bartsch, Dietrich Paulus, Peer Neubert