New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
DafnyBench: A Benchmark for Formal Software Verification
Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen
Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection
Can Akbas, Irem Su Arin, Sinan Onal
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
Aidar Myrzakhan, Sondos Mahmoud Bsharat, Zhiqiang Shen
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, Nidhi Rastogi
Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval
Adrià Molina, Oriol Ramos Terrades, Josep Lladós
DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang, Yingzhuo Yu, Jin Huang, Xingjian Zhang, Jiaqi Ma
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain
EEG-ImageNet: An Electroencephalogram Dataset and Benchmarks with Image Visual Stimuli of Multi-Granularity Labels
Shuqi Zhu, Ziyi Ye, Qingyao Ai, Yiqun Liu
Demonstrating HumanTHOR: A Simulation Platform and Benchmark for Human-Robot Collaboration in a Shared Workspace
Chenxu Wang, Boyuan Du, Jiaxin Xu, Peiyan Li, Di Guo, Huaping Liu
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk
Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench
Jialun Cao, Zhiyong Chen, Jiarong Wu, Shing-chi Cheung, Chang Xu
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
Weiping Fu, Bifan Wei, Jianxiang Hu, Zhongmin Cai, Jun Liu
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding
Yiqing Shen, Zan Chen, Michail Mamalakis, Luhan He, Haiyang Xia, Tianbin Li, Yanzhou Su, Junjun He, Yu Guang Wang
Towards a Benchmark for Causal Business Process Reasoning with LLMs
Fabiana Fournier, Lior Limonad, Inna Skarbovsky