Benchmark Suite

Benchmark suites are collections of standardized datasets and evaluation protocols designed to rigorously assess the performance of machine learning models across diverse tasks. Current research focuses on developing comprehensive suites for various domains, including video understanding, log analysis, compiler autotuning, and natural language processing, often evaluating large language models and other deep learning architectures. These suites are crucial for fostering reproducible research, enabling fair comparisons of different models and algorithms, and ultimately driving progress in the development of more robust and reliable AI systems with improved generalization capabilities across diverse real-world applications.

Papers

November 16, 2023

GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, Ansgar Scherp
Code Generation Benchmark Suite Code Search Program Comprehension CodeSearchNet Corpus

November 15, 2023

Improved Sparse Ising Optimization
Kenneth M. Zick
Many Sparse Benchmark Suite Sparse Optimization Deep Boltzmann

October 6, 2023

Routing Arena: A Benchmark Suite for Neural Routing Solvers
Daniela Thyssens, Tim Dernedde, Jonas K. Falkner, Lars Schmidt-Thieme
Neural Solver Benchmark Suite Neural Combinatorial Optimization Integer Linear Programming Solver Routing Game

September 29, 2023

Optimizing with Low Budgets: a Comparison on the Black-box Optimization Benchmarking Suite and OpenAI Gym
Elena Raponi, Nathanael Rakotonirina Carraz, Jérémy Rapin, Carola Doerr, Olivier Teytaud
Bayesian Optimization Consistent Comparison Black Box Optimization Benchmark Suite OpenAI Gym Low Budget

September 15, 2023

Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite
Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-shan Shiu
Large Language Model Language Model Global Evaluation Language Understanding English Dataset Benchmark Suite Chinese Language Model

July 20, 2023

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, Wei Wang
Reasoning Capability Reasoning Ability Benchmark Suite

May 24, 2023

Analysis of modular CMA-ES on strict box-constrained problems in the SBOX-COST benchmarking suite
Diederick Vermetten, Manuel López-Ibáñez, Olaf Mersmann, Richard Allmendinger, Anna V. Kononova
General Analysis Benchmark Suite CMA E Real World Optimization Problem Constraint Solving Box Constraint

May 20, 2023

Patterns of Convergence and Bound Constraint Violation in Differential Evolution on SBOX-COST Benchmarking Suite
Mădălina-Andreea Mitran, Anna V. Kononova, Fabio Caraffini, Daniela Zaharie
Early Stage Convergence Participation Constraint Complex Pattern Differential Evolution Benchmark Suite Constraint Violation Adaptive Strategy Constraint Solving Marginal Constraint

January 11, 2023

SynMotor: A Benchmark Suite for Object Attribute Regression and Multi-task Learning
Chengzhi Wu, Linxi Qiu, Kanran Zhou, Julius Pfrommer, Jürgen Beyerer
Point Cloud Multi Task Learning Synthetic Image Computer Vision Task Benchmark Suite Segmentation Annotation Object Attribute

December 20, 2022

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks
Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe
Spoken Language Understanding Benchmark Suite Speech Task

November 3, 2022

LMentry: A Language Model Benchmark of Elementary Language Tasks
Avia Efrat, Or Honovich, Omer Levy
Large Language Model Language Model OpenAI Codex Instruction Tuned Model Language Task Benchmark Suite

October 13, 2022

Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations
Xiang Fu, Zhenghao Wu, Wujie Wang, Tian Xie, Sinan Keten, Rafael Gomez-Bombarelli, Tommi Jaakkola
New Benchmark Critical Review Benchmark Suite External Human Force Force Estimation Molecular Simulation Ab Initio Machine Learning Force Field

July 20, 2022

The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing
Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, In So Kweon
Data Set Video Editing Video Analysis Benchmark Suite Video Manipulation Video Editing Task

June 12, 2022

Arena-Bench: A Benchmarking Suite for Obstacle Avoidance Approaches in Highly Dynamic Environments
Linh Kästner, Teham Bhuiyan, Tuan Anh Le, Elias Treis, Johannes Cox, Boris Meinardus, Jacek Kmiecik, Reyk Carstens, Duc Pichel, Bassel Fatloun, Niloufar Khorsandi, Jens Lambrecht
Mobile Robot Autonomous Navigation Obstacle Avoidance Learning Based Dynamic Environment Benchmark Suite Arena Hard

June 8, 2022

FedHPO-B: A Benchmark Suite for Federated Hyperparameter Optimization
Zhen Wang, Weirui Kuang, Ce Zhang, Bolin Ding, Yaliang Li
Hyperparameter Optimization Benchmark Suite Federated Hyperparameter

April 25, 2022

SELECTOR: Selecting a Representative Benchmark Suite for Reproducible Statistical Comparison
Gjorgjina Cenikj, Ryan Dieter Lang, Andries Petrus Engelbrecht, Carola Doerr, Peter Korošec, Tome Eftimov
Benchmark Datasets Optimization Algorithm Benchmark Suite Robust Algorithm Fair Algorithm Reproducible Evaluation Optimization Benchmark Network Selection

November 16, 2021

DataCLUE: A Benchmark Suite for Data-centric NLP
Liang Xu, Jiacheng Liu, Xiang Pan, Xiaojing Lu, Xiaofeng Hou
NLP Field Data Centric Benchmark Suite

Benchmark Suite

Papers

GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding

Improved Sparse Ising Optimization

Routing Arena: A Benchmark Suite for Neural Routing Solvers

Optimizing with Low Budgets: a Comparison on the Black-box Optimization Benchmarking Suite and OpenAI Gym

Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Analysis of modular CMA-ES on strict box-constrained problems in the SBOX-COST benchmarking suite

Patterns of Convergence and Bound Constraint Violation in Differential Evolution on SBOX-COST Benchmarking Suite

SynMotor: A Benchmark Suite for Object Attribute Regression and Multi-task Learning

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

LMentry: A Language Model Benchmark of Elementary Language Tasks

Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations

The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing

Arena-Bench: A Benchmarking Suite for Obstacle Avoidance Approaches in Highly Dynamic Environments

FedHPO-B: A Benchmark Suite for Federated Hyperparameter Optimization

SELECTOR: Selecting a Representative Benchmark Suite for Reproducible Statistical Comparison

DataCLUE: A Benchmark Suite for Data-centric NLP