New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling
Zilyu Ye, Jinxiu Liu, Ruotian Peng, Jinjin Cao, Zhiyang Chen, Yiyang Zhang, Ziwei Xuan, Mingyuan Zhou, Xiaoqian Shen, Mohamed Elhoseiny, Qi Liu, Guo-Jun Qi
PAGED: A Benchmark for Procedural Graphs Extraction from Documents
Weihong Du, Wenrui Liao, Hongru Liang, Wenqiang Lei
RelBench: A Benchmark for Deep Learning on Relational Databases
Joshua Robinson, Rishabh Ranjan, Weihua Hu, Kexin Huang, Jiaqi Han, Alejandro Dobles, Matthias Fey, Jan E. Lenssen, Yiwen Yuan, Zecheng Zhang, Xinwei He, Jure Leskovec
BEExAI: Benchmark to Evaluate Explainable AI
Samuel Sithakoul, Sara Meftah, Clément Feutry
End-to-end SYNTAX score prediction: benchmark and methods
Alexander Ponomarchuk, Ivan Kruzhilov, Galina Zubkova, Artem Shadrin, Ruslan Utegenov, Ivan Bessonov, Pavel Blinov
AgEval: A Benchmark for Zero-Shot and Few-Shot Plant Stress Phenotyping with Multimodal LLMs
Muhammad Arbab Arshad, Talukder Zaki Jubery, Tirtho Roy, Rim Nassiri, Asheesh K. Singh, Arti Singh, Chinmay Hegde, Baskar Ganapathysubramanian, Aditya Balu, Adarsh Krishnamurthy, Soumik Sarkar
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian
The Cross-environment Hyperparameter Setting Benchmark for Reinforcement Learning
Andrew Patterson, Samuel Neumann, Raksha Kumaraswamy, Martha White, Adam White
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, Jonghyun Choi
$\textit{BenchIE}^{FL}$ : A Manually Re-Annotated Fact-Based Open Information Extraction Benchmark
Fabrice Lamarche, Philippe Langlais
PyBench: Evaluating LLM Agent on various real-world coding tasks
Yaolun Zhang, Yinxu Pan, Yudong Wang, Jie Cai
Can time series forecasting be automated? A benchmark and analysis
Anvitha Thirthapura Sreedhara, Joaquin Vanschoren
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning
Florian Felten, Umut Ucak, Hicham Azmani, Gao Peng, Willem Röpke, Hendrik Baier, Patrick Mannion, Diederik M. Roijers, Jordan K. Terry, El-Ghazali Talbi, Grégoire Danoy, Ann Nowé, Roxana Rădulescu