New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Sketch-based Video Object Segmentation: Benchmark and Analysis
Ruolin Yang, Da Li, Conghui Hu, Timothy Hospedales, Honggang Zhang, Yi-Zhe Song
Exposition on over-squashing problem on GNNs: Current Methods, Benchmarks and Challenges
Dai Shi, Andi Han, Lequan Lin, Yi Guo, Junbin Gao
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt, Aykut Erdem, Erkut Erdem
Refining the ONCE Benchmark with Hyperparameter Tuning
Maksim Golyadkin, Alexander Gambashidze, Ildar Nurgaliev, Ilya Makarov
Trends in Integration of Knowledge and Large Language Models: A Survey and Taxonomy of Methods, Benchmarks, and Applications
Zhangyin Feng, Weitao Ma, Weijiang Yu, Lei Huang, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting liu
JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds
Saeed Saadatnejad, Yang Gao, Hamid Rezatofighi, Alexandre Alahi
ISAR: A Benchmark for Single- and Few-Shot Object Instance Segmentation and Re-Identification
Nicolas Gorlo, Kenneth Blomqvist, Francesco Milano, Roland Siegwart
Benchmarking a Benchmark: How Reliable is MS-COCO?
Eric Zimmermann, Justin Szeto, Jerome Pasquero, Frederic Ratle
Don't Make Your LLM an Evaluation Benchmark Cheater
Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, Lu Hou
SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark
Zhengdi Yu, Shaoli Huang, Yongkang Cheng, Tolga Birdal
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, Wei Wang