New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
A Benchmark for Learning to Translate a New Language from One Grammar Book
Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, Luke Melas-Kyriazi
KLoB: a Benchmark for Assessing Knowledge Locating Methods in Language Models
Yiming Ju, Xingrun Xing, Zhixiong Zeng
Unmasking the Chameleons: A Benchmark for Out-of-Distribution Detection in Medical Tabular Data
Mohammad Azizmalayeri, Ameen Abu-Hanna, Giovanni Ciná
Large Language Model Routing with Benchmark Datasets
Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, Mikhail Yurochkin
GNN4EEG: A Benchmark and Toolkit for Electroencephalography Classification with Graph Neural Network
Kaiyuan Zhang, Ziyi Ye, Qingyao Ai, Xiaohui Xie, Yiqun Liu
Looking at words and points with attention: a benchmark for text-to-shape coherence
Andrea Amaduzzi, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
mEBAL2 Database and Benchmark: Image-based Multispectral Eyeblink Detection
Roberto Daza, Aythami Morales, Julian Fierrez, Ruben Tolosana, Ruben Vera-Rodriguez