New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages
Edward Bayes, Israel Abebe Azime, Jesujoba O. Alabi, Jonas Kgomo, Tyna Eloundou, Elizabeth Proehl, Kai Chen, Imaan Khadir, Naome A. Etori, Shamsuddeen Hassan Muhammad, Choice Mpanza, Igneciah Pocia Thete, Dietrich Klakow, David Ifeoluwa Adelani
SEED4D: A Synthetic Ego--Exo Dynamic 4D Data Generator, Driving Dataset and Benchmark
Marius Kästingschäfer, Théo Gieruc, Sebastian Bernhard, Dylan Campbell, Eldar Insafutdinov, Eyvaz Najafli, Thomas Brox
ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain
Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee
Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach
Feiyang Liu, Dan Guo, Jingyuan Xu, Zihao He, Shengeng Tang, Kun Li, Meng Wang
Are Anomaly Scores Telling the Whole Story? A Benchmark for Multilevel Anomaly Detection
Tri Cao, Minh-Huy Trinh, Ailin Deng, Quoc-Nam Nguyen, Khoa Duong, Ngai-Man Cheung, Bryan Hooi
Self-supervised learning for radio-astronomy source classification: a benchmark
Thomas Cecconello, Simone Riggi, Ugo Becciano, Fabio Vitello, Andrew M. Hopkins, Giuseppe Vizzari, Concetto Spampinato, Simone Palazzo
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
Shirley Kokane, Ming Zhu, Tulika Awalgaonkar, Jianguo Zhang, Thai Hoang, Akshara Prabhakar, Zuxin Liu, Tian Lan, Liangwei Yang, Juntao Tan, Rithesh Murthy, Weiran Yao, Zhiwei Liu, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong, Silivo Savarese
AIDBench: A benchmark for evaluating the authorship identification capability of large language models
Zichen Wen, Dadi Guo, Huishuai Zhang
Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark
Bing Cao, Quanhao Lu, Jiekang Feng, Pengfei Zhu, Qinghua Hu, Qilong Wang
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, Mykel J. Kochenderfer
Introducing Milabench: Benchmarking Accelerators for AI
Pierre Delaunay, Xavier Bouthillier, Olivier Breuleux, Satya Ortiz-Gagné, Olexa Bilaniuk, Fabrice Normandin, Arnaud Bergeron, Bruno Carrez, Guillaume Alain, Soline Blanc, Frédéric Osterrath, Joseph Viviano, Roger Creus-Castanyer Darshan Patil, Rabiul Awal, Le Zhang
MSSIDD: A Benchmark for Multi-Sensor Denoising
Shibin Mei, Hang Wang, Bingbing Ni
Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
Yutao Hou, Yajing Luo, Zhiwen Ruan, Hongru Wang, Weifeng Ge, Yun Chen, Guanhua Chen
InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma
Xiaoxuan Hou, Jiayi Yuan, Joel Z. Leibo, Natasha Jaques