Evaluation Benchmark
Evaluation benchmarks are crucial for assessing the performance of large language models (LLMs) and other AI systems across diverse tasks, providing objective measures of capabilities and identifying areas for improvement. Current research focuses on developing comprehensive benchmarks that address various challenges, including data contamination, bias, and the evaluation of specific model functionalities (e.g., tool use, image editing, and video analysis), often incorporating novel metrics and datasets. These benchmarks are vital for fostering reproducible research, enabling fair comparisons between models, and ultimately driving the development of more robust and reliable AI systems with real-world applications.
Papers
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting
Zhe Li, Xiangfei Qiu, Peng Chen, Yihang Wang, Hanyin Cheng, Yang Shu, Jilin Hu, Chenjuan Guo, Aoying Zhou, Qingsong Wen, Christian S. Jensen, Bin Yang
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Wanying Wang, Zeyu Ma, Pengfei Liu, Mingang Chen