New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
GTA: A Benchmark for General Tool Agents
Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, Xinyi Le
eyeballvul: a future-proof benchmark for vulnerability detection in the wild
Timothee Chauvin
AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models
Xiang Lisa Li, Evan Zheran Liu, Percy Liang, Tatsunori Hashimoto
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous Driving
Jannik Zürn, Paul Gladkov, Sofía Dudas, Fergal Cotter, Sofi Toteva, Jamie Shotton, Vasiliki Simaiaki, Nikhil Mohan
RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios
Liming Zheng, Feng Yan, Fanfan Liu, Chengjian Feng, Zhuoliang Kang, Lin Ma
NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification
Hongfei Huang, Tingting Liang, Xixi Sun, Zikang Jin, Yuyu Yin
KidSat: satellite imagery to map childhood poverty dataset and benchmark
Makkunda Sharma, Fan Yang, Duy-Nhat Vo, Esra Suel, Swapnil Mishra, Samir Bhatt, Oliver Fiala, William Rudgard, Seth Flaxman
TAPVid-3D: A Benchmark for Tracking Any Point in 3D
Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, Carl Doersch
A Benchmark for Multi-speaker Anonymization
Xiaoxiao Miao, Ruijie Tao, Chang Zeng, Xin Wang
Towards Reflected Object Detection: A Benchmark
Zhongtian Wang, You Wu, Hui Zhou, Shuiwang Li
LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models
Weizhi Tang, Vaishak Belle
iSign: A Benchmark for Indian Sign Language Processing
Abhinav Joshi, Romit Mohanty, Mounika Kanakanti, Andesha Mangla, Sudeep Choudhary, Monali Barbate, Ashutosh Modi
IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning
Abhinav Joshi, Shounak Paul, Akshat Sharma, Pawan Goyal, Saptarshi Ghosh, Ashutosh Modi
CLIMB: A Benchmark of Clinical Bias in Large Language Models
Yubo Zhang, Shudi Hou, Mingyu Derek Ma, Wei Wang, Muhao Chen, Jieyu Zhao
Tracking Reflected Objects: A Benchmark
Xiaoyu Guo, Pengzhi Zhong, Lizhi Lin, Hao Zhang, Ling Huang, Shuiwang Li