New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
BenchmarkCards: Large Language Model and Risk Reporting
Anna Sokol, Nuno Moniz, Elizabeth Daly, Michael Hind, Nitesh Chawla
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, Ion Stoica
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu, Linchao Zhu, Yi Yang
Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection
Qishun Wang, Zhengzheng Tu, Kunpeng Wang, Le Gu, Chuanwang Guo
Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data
Seiji Maekawa, Hayate Iso, Nikita Bhutani
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
Lorenzo Pacchiardi, Marko Tesic, Lucy G. Cheke, José Hernández-Orallo
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Wanying Wang, Zeyu Ma, Pengfei Liu, Mingang Chen
Experiences from Creating a Benchmark for Sentiment Classification for Varieties of English
Dipankar Srirag, Jordan Painter, Aditya Joshi, Diptesh Kanojia
Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks
Nathaniel Demchak, Xin Guan, Zekun Wu, Ziyi Xu, Adriano Koshiyama, Emre Kazim
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Jing Yao, Si-Qing Chen, Michael Wooldridge, Furu Wei
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid Karlinsky, Raja Giryes
GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation
Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, Doyen Sahoo
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies
A Benchmark for Cross-Domain Argumentative Stance Classification on Social Media
Jiaqing Yuan, Ruijie Xi, Munindar P. Singh
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment
Claas A Voelcker, Marcel Hussing, Eric Eaton
A Comparative Analysis on Ethical Benchmarking in Large Language Models
Kira Sam, Raja Vavekanand