New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, Yejin Choi
A Benchmark for Multi-Modal Lidar SLAM with Ground Truth in GNSS-Denied Environments
Ha Sier, Li Qingqing, Yu Xianjia, Jorge Peña Queralta, Zhuo Zou, Tomi Westerlund
A Benchmark for Weakly Semi-Supervised Abnormality Localization in Chest X-Rays
Haoqin Ji, Haozhe Liu, Yuexiang Li, Jinheng Xie, Nanjun He, Yawen Huang, Dong Wei, Xinrong Chen, Linlin Shen, Yefeng Zheng
A Benchmark for Unsupervised Anomaly Detection in Multi-Agent Trajectories
Julian Wiederer, Julian Schmidt, Ulrich Kressel, Klaus Dietmayer, Vasileios Belagiannis
MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages
Qingyu Zhang, Xiaoyu Shen, Ernie Chang, Jidong Ge, Pengke Chen
CrackSeg9k: A Collection and Benchmark for Crack Segmentation Datasets and Frameworks
Shreyas Kulkarni, Shreyas Singh, Dhananjay Balakrishnan, Siddharth Sharma, Saipraneeth Devunuri, Sai Chowdeswara Rao Korlapati