New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark
Chang Xu, Jinwang Wang, Wen Yang, Huai Yu, Lei Yu, Gui-Song Xia
BAGEL: A Benchmark for Assessing Graph Neural Network Explanations
Mandeep Rathee, Thorben Funke, Avishek Anand, Megha Khosla
Learning Gait Representation from Massive Unlabelled Walking Videos: A Benchmark
Chao Fan, Saihui Hou, Jilong Wang, Yongzhen Huang, Shiqi Yu
BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing
Subhro Roy, Sam Thomson, Tongfei Chen, Richard Shin, Adam Pauls, Jason Eisner, Benjamin Van Durme
BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios
Jiafei Duan, Samson Yu, Nicholas Tan, Li Yi, Cheston Tan
Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery
Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku
CTooth: A Fully Annotated 3D Dataset and Benchmark for Tooth Volume Segmentation on Cone Beam Computed Tomography Images
Weiwei Cui, Yaqi Wang, Qianni Zhang, Huiyu Zhou, Dan Song, Xingyong Zuo, Gangyong Jia, Liaoyuan Zeng
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks
Ganqu Cui, Lifan Yuan, Bingxiang He, Yangyi Chen, Zhiyuan Liu, Maosong Sun
Taxonomy of Benchmarks in Graph Representation Learning
Renming Liu, Semih Cantürk, Frederik Wenkel, Sarah McGuire, Xinyi Wang, Anna Little, Leslie O'Bray, Michael Perlmutter, Bastian Rieck, Matthew Hirn, Guy Wolf, Ladislav Rampášek
PolyU-BPCoMa: A Dataset and Benchmark Towards Mobile Colorized Mapping Using a Backpack Multisensorial System
Wenzhong Shi, Pengxin Chen, Muyang Wang, Sheng Bao, Haodong Xiang, Yue Yu, Daping Yang