New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts
Deyu Zou, Shikun Liu, Siqi Miao, Victor Fung, Shiyu Chang, Pan Li
GraphextQA: A Benchmark for Evaluating Graph-Enhanced Large Language Models
Yuanchun Shen, Ruotong Liao, Zhen Han, Yunpu Ma, Volker Tresp
AutoVP: An Automated Visual Prompting Framework and Benchmark
Hsi-Ai Tsao, Lei Hsiung, Pin-Yu Chen, Sijia Liu, Tsung-Yi Ho
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection
Shiping Yang, Renliang Sun, Xiaojun Wan
Skeleton Ground Truth Extraction: Methodology, Annotation Tool and Benchmarks
Cong Yang, Bipin Indurkhya, John See, Bo Gao, Yan Ke, Zeyd Boukhers, Zhenyu Yang, Marcin Grzegorzek
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models
Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg
Human-centric Behavior Description in Videos: New Benchmark and Model
Lingru Zhou, Yiqi Gao, Manqing Zhang, Peng Wu, Peng Wang, Yanning Zhang
SCB-Dataset3: A Benchmark for Detecting Student Classroom Behavior
Fan Yang, Tao Wang
RSRD: A Road Surface Reconstruction Dataset and Benchmark for Safe and Comfortable Autonomous Driving
Tong Zhao, Chenfeng Xu, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan, Yintao Wei
CoralVOS: Dataset and Benchmark for Coral Video Segmentation
Zheng Ziqiang, Xie Yaofeng, Liang Haixin, Yu Zhibin, Sai-Kit Yeung
Beyond the Benchmark: Detecting Diverse Anomalies in Videos
Yoav Arad, Michael Werman
Mini-BEHAVIOR: A Procedurally Generated Benchmark for Long-horizon Decision-Making in Embodied AI
Emily Jin, Jiaheng Hu, Zhuoyi Huang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Roberto Martín-Martín
SmartPlay: A Benchmark for LLMs as Intelligent Agents
Yue Wu, Xuan Tang, Tom M. Mitchell, Yuanzhi Li
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models
Man Luo, Shrinidhi Kumbhar, Ming shen, Mihir Parmar, Neeraj Varshney, Pratyay Banerjee, Somak Aditya, Chitta Baral