New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
COCO-Occ: A Benchmark for Occluded Panoptic Segmentation and Image Understanding
Wenbo Wei, Jun Wang, Abhir Bhalerao
CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks
Zhaozhi Qian, Faroq Altam, Muhammad Saleh Saeed Alqurishi, Riad Souissi
Reference Dataset and Benchmark for Reconstructing Laser Parameters from On-axis Video in Powder Bed Fusion of Bulk Stainless Steel
Cyril Blanc, Ayyoub Ahar, Kurt De Grave
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs
Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam
Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark
Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya
HS3-Bench: A Benchmark and Strong Baseline for Hyperspectral Semantic Segmentation in Driving Scenarios
Nick Theisen, Robin Bartsch, Dietrich Paulus, Peer Neubert
Eureka: Evaluating and Understanding Large Foundation Models
Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, Safoora Yousefi
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
USTC-TD: A Test Dataset and Benchmark for Image and Video Coding in 2020s
Zhuoyuan Li, Junqi Liao, Chuanbo Tang, Haotian Zhang, Yuqi Li, Yifan Bian, Xihua Sheng, Xinmin Feng, Yao Li, Changsheng Gao, Li Li, Dong Liu, Feng Wu
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
Sacha Muller, António Loison, Bilel Omrani, Gautier Viaud
Texture-AD: An Anomaly Detection Dataset and Benchmark for Real Algorithm Development
Tianwu Lei, Bohan Wang, Silin Chen, Shurong Cao, Ningmu Zou
ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog
Yujian Gan, Changling Li, Jinxia Xie, Luou Wen, Matthew Purver, Massimo Poesio
Are Heterophily-Specific GNNs and Homophily Metrics Really Effective? Evaluation Pitfalls and New Benchmarks
Sitao Luan, Qincheng Lu, Chenqing Hua, Xinyu Wang, Jiaqi Zhu, Xiao-Wen Chang, Guy Wolf, Jian Tang
A System and Benchmark for LLM-based Q\&A on Heterogeneous Data
Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J.Baseman