New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Can time series forecasting be automated? A benchmark and analysis
Anvitha Thirthapura Sreedhara, Joaquin Vanschoren
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning
Florian Felten, Umut Ucak, Hicham Azmani, Gao Peng, Willem Röpke, Hendrik Baier, Patrick Mannion, Diederik M. Roijers, Jordan K. Terry, El-Ghazali Talbi, Grégoire Danoy, Ann Nowé, Roxana Rădulescu
AICircuit: A Multi-Level Dataset and Benchmark for AI-Driven Analog Integrated Circuit Design
Asal Mehradfar, Xuzhe Zhao, Yue Niu, Sara Babakniya, Mahdi Alesheikh, Hamidreza Aghasi, Salman Avestimehr
Benchmarks as Microscopes: A Call for Model Metrology
Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi Saphra
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu, Dongxu Li, Bei Chen, Junnan Li
DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design
Zhi Hao Luo, Luis Lara, Ge Ya Luo, Florian Golemo, Christopher Beckham, Christopher Pal
Baba Is AI: Break the Rules to Beat the Benchmark
Nathan Cloos, Meagan Jens, Michelangelo Naim, Yen-Ling Kuo, Ignacio Cases, Andrei Barbu, Christopher J. Cueva
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen
PetFace: A Large-Scale Dataset and Benchmark for Animal Identification
Risa Shinoda, Kaede Shiohara
A Dataset and Benchmark for Shape Completion of Fruits for Agricultural Robotics
Federico Magistri, Thomas Läbe, Elias Marks, Sumanth Nagulavancha, Yue Pan, Claus Smitt, Lasse Klingbeil, Michael Halstead, Heiner Kuhlmann, Chris McCool, Jens Behley, Cyrill Stachniss
SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning
Joseph Marvin Imperial, Harish Tayyar Madabushi
Deep Time Series Models: A Comprehensive Survey and Benchmark
Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, Jianmin Wang
LSD3K: A Benchmark for Smoke Removal from Laparoscopic Surgery Images
Wenhui Chang, Hongming Chen
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects
Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, Xiaokang Yang
RBAD: A Dataset and Benchmark for Retinal Vessels Branching Angle Detection
Hao Wang, Wenhui Zhu, Jiayou Qin, Xin Li, Oana Dumitrascu, Xiwen Chen, Peijie Qiu, Abolfazl Razi
M18K: A Comprehensive RGB-D Dataset and Benchmark for Mushroom Detection and Instance Segmentation
Abdollah Zakeri, Mulham Fawakherji, Jiming Kang, Bikram Koirala, Venkatesh Balan, Weihang Zhu, Driss Benhaddou, Fatima A. Merchant
DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems
Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu