New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
A Framework and Benchmark for Deep Batch Active Learning for Regression
David Holzmüller, Viktor Zaverkin, Johannes Kästner, Ingo Steinwart
BrainGB: A Benchmark for Brain Network Analysis with Graph Neural Networks
Hejie Cui, Wei Dai, Yanqiao Zhu, Xuan Kan, Antonio Aodong Chen Gu, Joshua Lukemire, Liang Zhan, Lifang He, Ying Guo, Carl Yang
Towards True Detail Restoration for Super-Resolution: A Benchmark and a Quality Metric
Eugene Lyapustin, Anastasia Kirillova, Viacheslav Meshchaninov, Evgeney Zimin, Nikolai Karetin, Dmitriy Vatolin
E-KAR: A Benchmark for Rationalizing Natural Language Analogical Reasoning
Jiangjie Chen, Rui Xu, Ziquan Fu, Wei Shi, Zhongqiao Li, Xinbo Zhang, Changzhi Sun, Lei Li, Yanghua Xiao, Hao Zhou
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig
RB2: Robotic Manipulation Benchmarking with a Twist
Sudeep Dasari, Jianren Wang, Joyce Hong, Shikhar Bahl, Yixin Lin, Austin Wang, Abitha Thankaraj, Karanbir Chahal, Berk Calli, Saurabh Gupta, David Held, Lerrel Pinto, Deepak Pathak, Vikash Kumar, Abhinav Gupta
Evaluating the Text-to-SQL Capabilities of Large Language Models
Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau
HEAR: Holistic Evaluation of Audio Representations
Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk
On the importance of stationarity, strong baselines and benchmarks in transport prediction problems
Filipe Rodrigues
A Data-scalable Transformer for Medical Image Segmentation: Architecture, Model Efficiency, and Benchmark
Yunhe Gao, Mu Zhou, Di Liu, Zhennan Yan, Shaoting Zhang, Dimitris N. Metaxas
KMIR: A Benchmark for Evaluating Knowledge Memorization, Identification and Reasoning Abilities of Language Models
Daniel Gao, Yantao Jia, Lei Li, Chengzhen Fu, Zhicheng Dou, Hao Jiang, Xinyu Zhang, Lei Chen, Zhao Cao