New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark
Zhengfei Kuang, Yunzhi Zhang, Hong-Xing Yu, Samir Agarwala, Shangzhe Wu, Jiajun Wu
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, Maarten Sap
Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language Models
Gonzalo Martínez, Javier Conde, Elena Merino-Gómez, Beatriz Bermúdez-Margaretto, José Alberto Hernández, Pedro Reviriego, Marc Brysbaert
CITB: A Benchmark for Continual Instruction Tuning
Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad
InstructExcel: A Benchmark for Natural Language Instruction in Excel
Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy, Benjamin Van Durme, Elnaz Nouri
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Chunlei Wang, Wenquan Feng, Xiangtai Li, Guangliang Cheng, Shuchang Lyu, Binghao Liu, Lijiang Chen, Qi Zhao
Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques
Zhengcong Yin, Diya Li, Daniel W. Goldberg
DiFair: A Benchmark for Disentangled Assessment of Gender Knowledge and Bias
Mahdi Zakizadeh, Kaveh Eskandari Miandoab, Mohammad Taher Pilehvar
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation
Shengqiang Zhang, Philipp Wicke, Lütfi Kerem Şenel, Luis Figueredo, Abdeldjallil Naceri, Sami Haddadin, Barbara Plank, Hinrich Schütze
A Benchmark for Semi-Inductive Link Prediction in Knowledge Graphs
Adrian Kochsiek, Rainer Gemulla
VFLAIR: A Research Library and Benchmark for Vertical Federated Learning
Tianyuan Zou, Zixuan Gu, Yu He, Hideaki Takahashi, Yang Liu, Ya-Qin Zhang
Improving Access to Justice for the Indian Population: A Benchmark for Evaluating Translation of Legal Text to Indian Languages
Sayan Mahapatra, Debtanu Datta, Shubham Soni, Adrijit Goswami, Saptarshi Ghosh
New Benchmarks for Asian Facial Recognition Tasks: Face Classification with Large Foundation Models
Jinwoo Seo, Soora Choi, Eungyeom Ha, Beomjune Kim, Dongbin Na