New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data
Xuwu Wang, Qiwen Cui, Yunzhe Tao, Yiran Wang, Ziwei Chai, Xiaotian Han, Boyi Liu, Jianbo Yuan, Jing Su, Guoyin Wang, Tingkai Liu, Liyu Chen, Tianyi Liu, Tao Sun, Yufeng Zhang, Sirui Zheng, Quanzeng You, Yang Yang, Hongxia Yang
RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations
Kaichen Zhou, Yang Cao, Taewhan Kim, Hao Zhao, Hao Dong, Kai Ming Ting, Ye Zhu
MCUBench: A Benchmark of Tiny Object Detectors on MCUs
Sudhakar Sah, Darshan C. Ganji, Matteo Grimaldi, Ravish Kumar, Alexander Hoffman, Honnesh Rohmetra, Ehsan Saboori
Off to new Shores: A Dataset & Benchmark for (near-)coastal Flood Inundation Forecasting
Brandon Victor, Mathilde Letard, Peter Naylor, Karim Douch, Nicolas Longépé, Zhen He, Patrick Ebel
DMC-VB: A Benchmark for Representation Learning for Control with Visual Distractors
Joseph Ortiz, Antoine Dedieu, Wolfgang Lehrach, Swaroop Guntupalli, Carter Wendelken, Ahmad Humayun, Guangyao Zhou, Sivaramakrishnan Swaminathan, Miguel Lázaro-Gredilla, Kevin Murphy
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
Elliot L. Epstein, Kaisheng Yao, Jing Li, Xinyi Bai, Hamid Palangi
Tiny Robotics Dataset and Benchmark for Continual Object Detection
Francesco Pasti, Riccardo De Monte, Davide Dalle Pezze, Gian Antonio Susto, Nicola Bellotto
CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data
Qianwen Zhang, Haochen Wang, Fang Li, Siyu An, Lingfeng Qiao, Liangcai Gao, Di Yin, Xing Sun
Hydrogen under Pressure as a Benchmark for Machine-Learning Interatomic Potentials
Thomas Bischoff, Bastian Jäckl, Matthias Rupp
EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models
Yuyan Chen, Hao Wang, Songzhou Yan, Sijia Liu, Yueze Li, Yi Zhao, Yanghua Xiao
Revisiting Synthetic Human Trajectories: Imitative Generation and Benchmarks Beyond Datasaurus
Bangchao Deng, Xin Jing, Tianyue Yang, Bingqing Qu, Philippe Cudre-Mauroux, Dingqi Yang
JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models
Junfeng Jiang, Jiahao Huang, Akiko Aizawa