New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang
Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark
Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yi Lu, Bozheng Li, Weiheng Chi, Zihan Qiu, Lirian Su, Haolin Zheng, Jay Wu, Xu Yang
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
Lan Li, Liri Fang, Vetle I. Torvik
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks
Arth Shukla, Stone Tao, Hao Su
TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft
Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao
A Survey of Large Language Model-Based Generative AI for Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges
Aditi Singh, Akash Shetty, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei
BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English
Dipankar Srirag, Aditya Joshi, Jordan Painter, Diptesh Kanojia
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu, Yuying Ge, Yi Chen, Yixiao Ge, Ying Shan, Xihui Liu
Bench-CoE: a Framework for Collaboration of Experts from Benchmark
Yuanshuai Wang, Xingjian Zhang, Jinkun Zhao, Siwei Wen, Peilin Feng, Shuhao Liao, Lei Huang, Wenjun Wu
Does your model understand genes? A benchmark of gene properties for biological and text models
Yoav Kan-Tor, Michael Morris Danziger, Eden Zohar, Matan Ninio, Yishai Shimoni
Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code
Timur Galimzyanov, Sergey Titov, Yaroslav Golubev, Egor Bogomolov
Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark
Haidong Xu, Meishan Zhang, Hao Ju, Zhedong Zheng, Hongyuan Zhu, Erik Cambria, Min Zhang, Hao Fei