New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Are Heterophily-Specific GNNs and Homophily Metrics Really Effective? Evaluation Pitfalls and New Benchmarks
Sitao Luan, Qincheng Lu, Chenqing Hua, Xinyu Wang, Jiaqi Zhu, Xiao-Wen Chang, Guy Wolf, Jian Tang
A System and Benchmark for LLM-based Q\&A on Heterogeneous Data
Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J.Baseman
AnomalyCD: A benchmark for Earth anomaly change detection with high-resolution and time-series observations
Jingtao Li, Qian Zhu, Xinyu Wang, Hengwei Zhao, Yanfei Zhong
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
Kinjal Basu, Ibrahim Abdelaziz, Kelsey Bradford, Maxwell Crouse, Kiran Kate, Sadhana Kumaravel, Saurabh Goyal, Asim Munawar, Yara Rizk, Xin Wang, Luis Lastras, Pavan Kapanipathi
Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving
Yuhang Lu, Yichen Yao, Jiadong Tu, Jiangnan Shao, Yuexin Ma, Xinge Zhu
TLD: A Vehicle Tail Light signal Dataset and Benchmark
Jinhao Chai, Shiyi Mu, Shugong Xu
H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark
Solim LeGris, Wai Keen Vong, Brenden M. Lake, Todd M. Gureckis
Adversarial Pruning: A Survey and Benchmark of Pruning Methods for Adversarial Robustness
Giorgio Piras, Maura Pintor, Ambra Demontis, Battista Biggio, Giorgio Giacinto, Fabio Roli
Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks
Hongjun Wang, Sagar Vaze, Kai Han
Mismatched: Evaluating the Limits of Image Matching Approaches and Benchmarks
Sierra Bonilla, Chiara Di Vece, Rema Daher, Xinwei Ju, Danail Stoyanov, Francisco Vasconcelos, Sophia Bano
CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases
Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan
BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, Jie Tang
RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments
Haisheng Su, Feixiang Song, Cong Ma, Wei Wu, Junchi Yan
CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models
Shubham Bharti, Shiyun Cheng, Jihyun Rho, Martina Rao, Xiaojin Zhu
CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence
Chaochao Chen, Jiaming Zhang, Yizhao Zhang, Li Zhang, Lingjuan Lyu, Yuyuan Li, Biao Gong, Chenggang Yan