Benchmark Datasets
Benchmark datasets are crucial for evaluating the performance of machine learning models across diverse tasks, from natural language processing to image analysis and graph classification. Current research emphasizes the need for more robust and representative datasets, addressing issues like data leakage, bias, and distribution mismatches that can skew results and hinder fair comparisons between models. This focus on improved dataset quality is vital for ensuring the reliability of model evaluations and driving progress in the development of more accurate and generalizable algorithms, ultimately impacting the trustworthiness and practical applicability of AI systems.
Papers
Towards Better Benchmark Datasets for Inductive Knowledge Graph Completion
Harry Shomer, Jay Revolinsky, Jiliang Tang
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework
Olivier Binette, Jerome P. Reiter
TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs
Julia Gastinger, Shenyang Huang, Mikhail Galkin, Erfan Loghmani, Ali Parviz, Farimah Poursafaei, Jacob Danovitch, Emanuele Rossi, Ioannis Koutis, Heiner Stuckenschmidt, Reihaneh Rabbany, Guillaume Rabusseau
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
Liam Dugan, Alyssa Hwang, Filip Trhlik, Josh Magnus Ludan, Andrew Zhu, Hainiu Xu, Daphne Ippolito, Chris Callison-Burch
Squeezing Lemons with Hammers: An Evaluation of AutoML and Tabular Deep Learning for Data-Scarce Classification Applications
Ricardo Knauer, Erik Rodner