New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
A New Benchmark and Model for Challenging Image Manipulation Detection
Zhenfei Zhang, Mingyang Li, Ming-Ching Chang
3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in Radiology
Asma Ben Abacha, Alberto Santamaria-Pang, Ho Hin Lee, Jameson Merkow, Qin Cai, Surya Teja Devarakonda, Abdullah Islam, Julia Gong, Matthew P. Lungren, Thomas Lin, Noel C Codella, Ivan Tarapov
AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL
Katharina Stein, Daniel Fišer, Jörg Hoffmann, Alexander Koller
Investigating Data Contamination in Modern Benchmarks for Large Language Models
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan
MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification
Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé Clavel, Fabian Suchanek
Imagine the Unseen World: A Benchmark for Systematic Generalization in Visual World Models
Yeongbin Kim, Gautam Singh, Junyeong Park, Caglar Gulcehre, Sungjin Ahn
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks
Ting-Yun Chang, Jesse Thomason, Robin Jia
ConeQuest: A Benchmark for Cone Segmentation on Mars
Mirali Purohit, Jacob Adler, Hannah Kerner
PSST: A Benchmark for Evaluation-driven Text Public-Speaking Style Transfer
Huashan Sun, Yixiao Wu, Yuhao Ye, Yizhe Yang, Yinghao Li, Jiawei Li, Yang Gao
Extrinsically-Focused Evaluation of Omissions in Medical Summarization
Elliot Schumacher, Daniel Rosenthal, Varun Nair, Luladay Price, Geoffrey Tso, Anitha Kannan
All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction
Yuhan Li, Jian Wu, Zhiwei Yu, Börje F. Karlsson, Wei Shen, Manabu Okumura, Chin-Yew Lin
RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge
Yi Liu, Lianzhe Huang, Shicheng Li, Sishuo Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun