Challenging Benchmark

Challenging benchmarks are crucial for evaluating the capabilities of large language models (LLMs) and other AI systems, pushing the boundaries of their performance beyond easily solvable tasks. Current research focuses on creating benchmarks that assess diverse skills, including cultural knowledge, mathematical reasoning, multimodal understanding, and complex reasoning in various domains like code generation and scientific claim verification, often employing techniques like chain-of-thought prompting. These efforts are vital for identifying and addressing limitations in current AI systems, ultimately leading to more robust and reliable models with broader applicability across diverse real-world scenarios. The development of these benchmarks is driving innovation in both model architecture and evaluation methodologies.

Papers

May 22, 2023

SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables
Xinyuan Lu, Liangming Pan, Qian Liu, Preslav Nakov, Min-Yen Kan
Compositional Reasoning Challenging Benchmark Evidence Based Scientific Claim Fact Checking Benchmark

April 22, 2023

OmniLabel: A Challenging Benchmark for Language-Based Object Detection
Samuel Schulter, Vijay Kumar B G, Yumin Suh, Konstantinos M. Dafnis, Zhixing Zhang, Shiyu Zhao, Dimitris Metaxas
Object Detection Arbitrary Object Challenging Benchmark Open Vocabulary Detection Language Based

March 7, 2023

A Challenging Benchmark for Low-Resource Learning
Yudong Wang, Chang Ma, Qingxiu Dong, Lingpeng Kong, Jingjing Xu
Low Resource Pre Trained Network Challenging Benchmark Robustness Gap

December 21, 2022

ORCA: A Challenging Benchmark for Arabic Language Understanding
AbdelRahim Elmadany, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
Language Model Multilingual Language Model Challenging Benchmark ORCa Behavior

December 15, 2022

Neuroevolution of Physics-Informed Neural Nets: Benchmark Problems and Comparative Results
Nicholas Sung Wei Yong, Jian Cheng Wong, Pao-Hsiung Chiu, Abhishek Gupta, Chinchun Ooi, Yew-Soon Ong
Gradient Descent Physic Informed Neural Network Challenging Benchmark Experimental Comparison NeuroEvolution Algorithm

December 14, 2022

SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning
Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob N. Foerster, Shimon Whiteson
Cooperative Multi Agent Reinforcement Learning Multi Agent Challenge Challenging Benchmark Improved Benchmark

November 1, 2022

ClassActionPrediction: A Challenging Benchmark for Legal Judgment Prediction of Class Action Cases in the US
Gil Semo, Dor Bernsohn, Ben Hagag, Gila Hayat, Joel Niklaus
Natural Language Processing Action Prediction Legal Judgment Prediction United State Challenging Benchmark Legal Natural Language Processing

July 20, 2022

The Game of Hidden Rules: A New Kind of Benchmark Challenge for Machine Learning
Eric Pulick, Shubham Bharti, Yiding Chen, Vladimir Menkov, Yonatan Mintz, Paul Kantor, Vicki M. Bier
Machine Learning Game Content Hidden Knowledge Challenging Benchmark Benchmark Environment

June 6, 2022

WHU-Stereo: A Challenging Benchmark for Stereo Matching of High-Resolution Satellite Images
Shenhong Li, Sheng He, San Jiang, Wanshou Jiang, Lin Zhang
Stereo Matching High Resolution Satellite Stereo Image Disparity Map Stereo Dataset Challenging Benchmark

April 29, 2022

A Challenging Benchmark of Anime Style Recognition
Haotang Li, Shengtao Guo, Kailin Lyu, Xiao Yang, Tianchen Chen, Jianqing Zhu, Huanqiang Zeng
Challenging Benchmark Anime Style Recognition

April 10, 2022

CholecTriplet2021: A benchmark challenge for surgical action triplet recognition
Chinedu Innocent Nwoye, Deepak Alapatt, Tong Yu, Armine Vardazaryan, Fangfang Xia, Zixuan Zhao, Tong Xia, Fucang Jia, Yuxuan Yang, Hao Wang, Derong Yu, Guoyan Zheng, Xiaotian Duan, Neil Getty, Ricardo Sanchez-Matilla, Maria Robu, Li Zhang, Huabin Chen, Jiacheng Wang, Liansheng Wang, Bokai Zhang, Beerend Gerats, Sista Raviteja, Rachana Sathish, Rong Tao, Satoshi Kondo, Winnie Pang, Hongliang Ren, Julian Ronald Abbing, Mohammad Hasan Sarhan, Sebastian Bodenstedt, Nithya Bhasker, Bruno Oliveira, Helena R. Torres, Li Ling, Finn Gaida, Tobias Czempiel, João L. Vilaça, Pedro Morais, Jaime Fonseca, Ruby Mae Egging, Inge Nicole Wijma, Chen Qian, Guibin Bian, Zhen Li, Velmurugan Balasubramanian, Debdoot Sheet, Imanol Luengo, Yuanbo Zhu, Shuai Ding, Jakob-Anton Aschenbrenner, Nicolas Elini van der Kar, Mengya Xu, Mobarakol Islam, Lalithkumar Seenivasan, Alexander Jenke, Danail Stoyanov, Didier Mutter, Pietro Mascagni, Barbara Seeliger, Cristians Gonzalez, Nicolas Padoy
Surgical Video Challenging Benchmark Surgical Workflow Analysis Surgical Action Surgical Action Triplet

Challenging Benchmark

Papers

SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables

OmniLabel: A Challenging Benchmark for Language-Based Object Detection

A Challenging Benchmark for Low-Resource Learning

ORCA: A Challenging Benchmark for Arabic Language Understanding

Neuroevolution of Physics-Informed Neural Nets: Benchmark Problems and Comparative Results

SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning

ClassActionPrediction: A Challenging Benchmark for Legal Judgment Prediction of Class Action Cases in the US

The Game of Hidden Rules: A New Kind of Benchmark Challenge for Machine Learning

WHU-Stereo: A Challenging Benchmark for Stereo Matching of High-Resolution Satellite Images

A Challenging Benchmark of Anime Style Recognition

CholecTriplet2021: A benchmark challenge for surgical action triplet recognition