New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Efficient multi-prompt evaluation of LLMs
Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin
Towards Ultra-High-Definition Image Deraining: A Benchmark and An Efficient Method
Hongming Chen, Xiang Chen, Chen Wu, Zhuoran Zheng, Jinshan Pan, Xianping Fu
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models
Tong Zhang, Peixin Qin, Yang Deng, Chen Huang, Wenqiang Lei, Junhong Liu, Dingnan Jin, Hongru Liang, Tat-Seng Chua
Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks
Ana Nikolikj, Ana Kostovska, Gjorgjina Cenikj, Carola Doerr, Tome Eftimov
Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models
Xiyu Wang, Yufei Wang, Satoshi Tsutsui, Weisi Lin, Bihan Wen, Alex C. Kot
STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan
Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models
Anthony M. Barrett, Krystal Jackson, Evan R. Murphy, Nada Madkour, Jessica Newman