New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
A Benchmark for Out of Distribution Detection in Point Cloud 3D Semantic Segmentation
Lokesh Veeramacheneni, Matias Valdenegro-Toro
SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection
Jiangyan Yi, Chenglong Wang, Jianhua Tao, Chu Yuan Zhang, Cunhang Fan, Zhengkun Tian, Haoxin Ma, Ruibo Fu
Data-Driven Network Neuroscience: On Data Collection and Benchmark
Jiaxing Xu, Yunhan Yang, David Tse Jung Huang, Sophi Shilpa Gururajapathy, Yiping Ke, Miao Qiao, Alan Wang, Haribalan Kumar, Josh McGeown, Eryn Kwon
DiaASQ : A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis
Bobo Li, Hao Fei, Fei Li, Yuhan Wu, Jinsong Zhang, Shengqiong Wu, Jingye Li, Yijiang Liu, Lizi Liao, Tat-Seng Chua, Donghong Ji
Benchmark for Models Predicting Human Behavior in Gap Acceptance Scenarios
Julian Frederik Schumann, Jens Kober, Arkady Zgonnikov
A new benchmark for group distribution shifts in hand grasp regression for object manipulation. Can meta-learning raise the bar?
Théo Morales, Gerard Lacey
WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain
Raj Sanjay Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, Diyi Yang
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering
Victor Zhong, Weijia Shi, Wen-tau Yih, Luke Zettlemoyer
CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization
Hossein Rajaby Faghihi, Bashar Alhafni, Ke Zhang, Shihao Ran, Joel Tetreault, Alejandro Jaimes
Instance Segmentation for Chinese Character Stroke Extraction, Datasets and Benchmarks
Lizhao Liu, Kunyang Lin, Shangxin Huang, Zhongli Li, Chao Li, Yunbo Cao, Qingyu Zhou
Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds
Joshua Albrecht, Abraham J. Fetterman, Bryden Fogelman, Ellie Kitanidis, Bartosz Wróblewski, Nicole Seo, Michael Rosenthal, Maksis Knutins, Zachary Polizzi, James B. Simon, Kanjun Qiu
ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition
Sanchit Gandhi, Patrick von Platen, Alexander M. Rush
BARS: A Benchmark for Airport Runway Segmentation
Wenhui Chen, Zhijiang Zhang, Liang Yu, Yichun Tai