New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
TiEBe: A Benchmark for Assessing the Current Knowledge of Large Language Models
Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos, Hugo Abonizio, Rodrigo Nogueira
Toward Realistic Camouflaged Object Detection: Benchmarks and Method
Zhimeng Xin, Tianxu Wu, Shiming Chen, Shuo Ye, Zijing Xie, Yixiong Zou, Xinge You, Yufei Guo
Bridging Smart Meter Gaps: A Benchmark of Statistical, Machine Learning and Time Series Foundation Models for Data Imputation
Amir Sartipi, Joaquin Delgado Fernandez, Sergio Potenciano Menci, Alessio Magitteri
Exploring Pose-Based Anomaly Detection for Retail Security: A Real-World Shoplifting Dataset and Benchmark
Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Shanle Yao, Hamed Tabkhi
Cooperative Aerial Robot Inspection Challenge: A Benchmark for Heterogeneous Multi-UAV Planning and Lessons Learned
Muqing Cao, Thien-Minh Nguyen, Shenghai Yuan, Andreas Anastasiou, Angelos Zacharia, Savvas Papaioannou, Panayiotis Kolios, Christos G. Panayiotou, Marios M. Polycarpou, Xinhang Xu, Mingjie Zhang, Fei Gao, Boyu Zhou, Ben M. Chen, Lihua Xie
Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights
Sy-Tuyen Ho, Tuan Van Vo, Somayeh Ebrahimkhani, Ngai-Man Cheung
AutoFish: Dataset and Benchmark for Fine-grained Analysis of Fish
Stefan Hein Bengtson, Daniel Lehotský, Vasiliki Ismiroglou, Niels Madsen, Thomas B. Moeslund, Malte Pedersen
Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks
Yang Wang, Chenghua Lin
Towards New Benchmark for AI Alignment & Sentiment Analysis in Socially Important Issues: A Comparative Study of Human and LLMs in the Context of AGI
Ljubisa Bojic, Dylan Seychell, Milan Cabarkapa
Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method
Hui Li, Xiaoyu Ren, Hongjiu Yu, Huiyu Duan, Kai Li, Ying Chen, Libo Wang, Xiongkuo Min, Guangtao Zhai, Xu Liu
Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark
Liang He, Yougang Chu, Zhen Wu, Jianbing Zhang, Xinyu Dai, Jiajun Chen
HarmonyIQA: Pioneering Benchmark and Model for Image Harmonization Quality Assessment
Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet