LLM Benchmark

LLM benchmarking aims to objectively evaluate the capabilities of large language models across diverse tasks, addressing limitations of existing methods like static datasets and potential biases in human or LLM evaluation. Current research focuses on developing more robust and dynamic benchmarks, including those based on real-world interactions, game-based competitions, and knowledge-grounded evaluations, often incorporating techniques like prompt engineering and multi-agent coordination. These efforts are crucial for fostering the responsible development and deployment of LLMs, improving model transparency, and guiding future research directions in AI.

Papers

October 9, 2024

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin
New Benchmark LLM Benchmark Null Space Cheating Detection

October 8, 2024

Active Evaluation Acquisition for Efficient LLM Benchmarking
Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, Graham Horwood
Large Language Model LLM Benchmark LLM Performance Large Scale Benchmark Efficient Evaluation Active Acquisition

October 7, 2024

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Qingchen Yu, Shichao Song, Ke Fang, Yunfeng Shi, Zifan Zheng, Hanyu Wang, Simin Niu, Zhiyu Li
Real World LLM Benchmark Evaluation Datasets Logical Reasoning Capability Latent Concept

October 4, 2024

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
Robert E. Blackwell, Jon Barry, Anthony G. Cohn
Large Language Model New Benchmark Uncertainty Quantification Benchmark Study Deterministic Algorithm LLM Benchmark

October 2, 2024

Extending Context Window of Large Language Models from a Distributional Perspective
Yingsheng Wu. Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin
Large Language Model Long Sequence LLM Benchmark Context Window Distributional Perspective

September 29, 2024

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy
Rui Pan, Tuan Dung Nguyen, Hardik Arora, Alberto Accomazzi, Tirthankar Ghosal, Yuan-Sen Ting
Large Language Model Full Model Astronomical Data Continual Pre Training LLM Benchmark

September 23, 2024

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson
Style Consistency Preference Optimization Implicit Bias LLM Benchmark Pairwise Preference LLM a a Judge Failure Mode

August 23, 2024

Open Llama2 Model for the Lithuanian Language
Artūras Nakvosas, Povilas Daniušis, Vytas Mulevičius
Large Language Model Open Source Large Language Model LLM Benchmark Language Understanding Task Tuned Llama Model

July 27, 2024

LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models
Gwenyth Isobel Meadows, Nicholas Wai Long Lau, Eva Adelina Susanto, Chi Lok Yu, Aditya Paul
Large Language Model Large Scale Responsible AI Value Alignment LLM Benchmark Local Alignment Ethical Risk

July 22, 2024

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
Georgy Tyukin, Gbetondji J-S Dovonon, Jean Kaddour, Pasquale Minervini
Scientific Inference Human Attention Attention Layer Low Latency LLM Benchmark Tuned Llama Model

July 21, 2024

AutoVCoder: A Systematic Framework for Automated Verilog Code Generation using LLMs
Mingzhe Gao, Jieru Zhao, Zhe Lin, Wenchao Ding, Xiaofeng Hou, Yu Feng, Chao Li, Minyi Guo
Large Language Model Code Generation LLM Benchmark Verilog Code Generation Functional Verilog Description

July 18, 2024

July 17, 2024

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task
Chris Madge, Massimo Poesio
Spatial Reasoning LLM Benchmark

July 13, 2024

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting
Sanchit Ahuja, Kumar Tanmay, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano Del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Awadallah, Monojit Choudhary, Vishrav Chaudhary, Sunayana Sitaram
Sample Efficiency Multilingual Benchmark Shot Prompting LLM Benchmark Multilingual Instruction Tuning Multilingual Instruction

July 10, 2024

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard
Oguzhan Topsakal, Colby Jacob Edell, Jackson Bailey Harper
Large Language Model Open Source Artificial General Intelligence LLM Benchmark Leaderboard Extraction Puzzle Game Game Based

June 27, 2024

LiveBench: A Challenging, Contamination-Free LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum
New Benchmark LLM Benchmark Many Recent Benchmark

June 17, 2024

The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
Kyle Moore, Jesse Roberts, Thao Pham, Oseremhen Ewaleifoh, Doug Fisher
Large Language Model General Strategy LLM Benchmark Massive Multitask Language Understanding Cloze Test Counterfactual Prompting Base Rate

June 14, 2024

What is the best model? Application-driven Evaluation for Large Language Models
Shiguo Lian, Kaikai Zhao, Xinhui Liu, Xuejiao Lei, Bikun Yang, Wenjing Zhang, Kai Wang, Zhaoxiang Liu
Large Language Model Global Evaluation Evaluation Method LLM Benchmark Evaluation Task Good Model

June 5, 2024

HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr, João F. Henriques, Jakob N. Foerster
New Benchmark Human Editing Wikipedia Article LLM Benchmark LLM Evaluation Evaluation Data

LLM Benchmark

Papers

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Active Evaluation Acquisition for Efficient LLM Benchmarking

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Extending Context Window of Large Language Models from a Distributional Perspective

AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy

Style over Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Open Llama2 Model for the Lithuanian Language

LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

AutoVCoder: A Systematic Framework for Automated Verilog Code Generation using LLMs

Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

LiveBench: A Challenging, Contamination-Free LLM Benchmark

The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

What is the best model? Application-driven Evaluation for Large Language Models

HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits