New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Hydrogen under Pressure as a Benchmark for Machine-Learning Interatomic Potentials
Thomas Bischoff, Bastian Jäckl, Matthias Rupp
EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models
Yuyan Chen, Hao Wang, Songzhou Yan, Sijia Liu, Yueze Li, Yi Zhao, Yanghua Xiao
Revisiting Synthetic Human Trajectories: Imitative Generation and Benchmarks Beyond Datasaurus
Bangchao Deng, Xin Jing, Tianyue Yang, Bingqing Qu, Philippe Cudre-Mauroux, Dingqi Yang
JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models
Junfeng Jiang, Jiahao Huang, Akiko Aizawa
System 2 thinking in OpenAI's o1-preview model: Near-perfect performance on a mathematics exam
Joost de Winter, Dimitra Dodou, Yke Bauke Eisma
COCO-Occ: A Benchmark for Occluded Panoptic Segmentation and Image Understanding
Wenbo Wei, Jun Wang, Abhir Bhalerao
CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks
Zhaozhi Qian, Faroq Altam, Muhammad Saleh Saeed Alqurishi, Riad Souissi
Reference Dataset and Benchmark for Reconstructing Laser Parameters from On-axis Video in Powder Bed Fusion of Bulk Stainless Steel
Cyril Blanc, Ayyoub Ahar, Kurt De Grave
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs
Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam
Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark
Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya
HS3-Bench: A Benchmark and Strong Baseline for Hyperspectral Semantic Segmentation in Driving Scenarios
Nick Theisen, Robin Bartsch, Dietrich Paulus, Peer Neubert
Eureka: Evaluating and Understanding Large Foundation Models
Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, Safoora Yousefi
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
USTC-TD: A Test Dataset and Benchmark for Image and Video Coding in 2020s
Zhuoyuan Li, Junqi Liao, Chuanbo Tang, Haotian Zhang, Yuqi Li, Yifan Bian, Xihua Sheng, Xinmin Feng, Yao Li, Changsheng Gao, Li Li, Dong Liu, Feng Wu