New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov
A Benchmark on Directed Graph Representation Learning in Hardware Designs
Haoyu Wang, Yinan Huang, Nan Wu, Pan Li
DisasterQA: A Benchmark for Assessing the performance of LLMs in Disaster Response
Rajat Rawat
ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments
Sourjyadip Ray, Kushal Gupta, Soumi Kundu, Payal Arvind Kasat, Somak Aditya, Pawan Goyal
FedGraph: A Research Library and Benchmark for Federated Graph Learning
Yuhang Yao, Yuan Li, Xinyi Fan, Junhao Li, Kay Liu, Weizhao Jin, Srivatsan Ravi, Philip S. Yu, Carlee Joe-Wong
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, Ping Luo
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz
Unveiling the Impact of Local Homophily on GNN Fairness: In-Depth Analysis and New Benchmarks
Donald Loveland, Danai Koutra
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms
Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, Suman Banerjee
Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
Robert E. Blackwell, Jon Barry, Anthony G. Cohn
Towards a Benchmark for Large Language Models for Business Process Management Tasks
Kiran Busch, Henrik Leopold
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension
Xinnan Dai, Haohao Qu, Yifen Shen, Bohang Zhang, Qihao Wen, Wenqi Fan, Dongsheng Li, Jiliang Tang, Caihua Shan
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
Ippei Fujisawa, Sensho Nobe, Hiroki Seto, Rina Onda, Yoshiaki Uchida, Hiroki Ikoma, Pei-Chun Chien, Ryota Kanai
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning
DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning
Jiaqing Xie, Yue Zhao, Tianfan Fu
MARPLE: A Benchmark for Long-Horizon Inference
Emily Jin, Zhuoyi Huang, Jan-Philipp Fränken, Weiyu Liu, Hannah Cha, Erik Brockbank, Sarah Wu, Ruohan Zhang, Jiajun Wu, Tobias Gerstenberg
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models
Zhenyue Qin, Yu Yin, Dylan Campbell, Xuansheng Wu, Ke Zou, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen
CrowdCounter: A benchmark type-specific multi-target counterspeech dataset
Punyajoy Saha, Abhilash Datta, Abhik Jana, Animesh Mukherjee
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
Ricardo Garcia, Shizhe Chen, Cordelia Schmid