MT Bench
MT Bench, encompassing a broad range of benchmarks, aims to rigorously evaluate the capabilities of large language and multimodal models across diverse tasks, including visual perception, video quality assessment, and dialogue generation. Current research focuses on developing standardized benchmarks with fine-grained difficulty annotations and realistic validation procedures, often employing large language models themselves for evaluation. These benchmarks are crucial for identifying limitations in current models and guiding the development of more robust and reliable AI systems with improved generalization and safety, impacting various fields from healthcare to scientific research.
Papers
March 18, 2024
March 7, 2024
February 22, 2024
February 20, 2024
February 17, 2024
February 13, 2024
January 9, 2024
December 22, 2023
December 11, 2023
November 27, 2023
October 23, 2023
September 25, 2023
September 6, 2023
August 12, 2023
June 9, 2023
June 4, 2023
June 3, 2023
April 20, 2023
April 11, 2023