New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Improving the Robustness of 3D Human Pose Estimation: A Benchmark and Learning from Noisy Input
Trung-Hieu Hoang, Mona Zehni, Huy Phan, Duc Minh Vo, Minh N. Do
A Multimodal Dataset and Benchmark for Radio Galaxy and Infrared Host Detection
Nikhel Gupta, Zeeshan Hayder, Ray P. Norris, Minh Hyunh, Lars Petersson
AI Competitions and Benchmarks: The life cycle of challenges and benchmarks
Gustavo Stolovitzky, Julio Saez-Rodriguez, Julie Bletz, Jacob Albrecht, Gaia Andreoletti, James C. Costello, Paul Boutros
AI Competitions and Benchmarks: Competition platforms
Andrey Ustyuzhanin, Harald Carlens
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding
Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, Yu Qiao
Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning
Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, Zhenhua Feng, Muhammad Awais
SCTc-TE: A Comprehensive Formulation and Benchmark for Temporal Event Forecasting
Yunshan Ma, Chenchen Ye, Zijian Wu, Xiang Wang, Yixin Cao, Liang Pang, Tat-Seng Chua
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding
Rohan Myer Krishnan, Zitian Tang, Zhiqiu Yu, Chen Sun
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models
Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine
Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications
Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel
SAIBench: A Structural Interpretation of AI for Science Through Benchmarks
Yatao Li, Jianfeng Zhan
Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications
Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, Hesheng Wang
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao
360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries
Huajian Huang, Changkun Liu, Yipeng Zhu, Hui Cheng, Tristan Braud, Sai-Kit Yeung
A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography
Noëmi Aepli, Chantal Amrhein, Florian Schottmann, Rico Sennrich
CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs
Haocheng Yuan, Jing Xu, Hao Pan, Adrien Bousseau, Niloy J. Mitra, Changjian Li