New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
FHIST: A Benchmark for Few-shot Classification of Histological Images
Fereshteh Shakeri, Malik Boudiaf, Sina Mohammadi, Ivaxi Sheth, Mohammad Havaei, Ismail Ben Ayed, Samira Ebrahimi Kahou
Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish
Michał Możdżonek, Anna Wróblewska, Sergiy Tkachuk, Szymon Łukasik
StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models
Adam Liška, Tomáš Kočiský, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, Cyprien de Masson d'Autume, Tim Scholtes, Manzil Zaheer, Susannah Young, Ellen Gilsenan-McMahon, Sophia Austin, Phil Blunsom, Angeliki Lazaridou
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla
Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Rifat Shahriyar
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification
Yang Xiao, Jinlan Fu, See-Kiong Ng, Pengfei Liu
ANUBIS: Skeleton Action Recognition Dataset, Review, and Benchmark
Zhenyue Qin, Yang Liu, Madhawa Perera, Tom Gedeon, Pan Ji, Dongwoo Kim, Saeed Anwar