New Benchmark
Recent research focuses on developing comprehensive benchmarks for evaluating large language models (LLMs) and other machine learning models across diverse tasks, including economic games, financial question answering, graph analysis, and robotic manipulation. These benchmarks aim to standardize evaluation methodologies, address issues like fairness and robustness, and quantify uncertainty in model performance, using various architectures such as transformers and graph neural networks. The resulting standardized evaluations and datasets are crucial for advancing the field by facilitating more rigorous comparisons of models and identifying areas needing improvement, ultimately leading to more reliable and effective AI systems across numerous applications.
Papers
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clemencon, Pierre Colombo
Motion-Scenario Decoupling for Rat-Aware Video Position Prediction: Strategy and Benchmark
Xiaofeng Liu, Jiaxin Gao, Yaohua Liu, Risheng Liu, Nenggan Zheng
Continual Facial Expression Recognition: A Benchmark
Nikhil Churamani, Tolga Dimlioglu, German I. Parisi, Hatice Gunes
Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations
Qingyu Chen, Jingcheng Du, Yan Hu, Vipina Kuttichi Keloth, Xueqing Peng, Kalpana Raja, Rui Zhang, Zhiyong Lu, Hua Xu
Compressed Video Quality Assessment for Super-Resolution: a Benchmark and a Quality Metric
Evgeney Bogatyrev, Ivan Molodetskikh, Dmitriy Vatolin
Facilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks
Junyu Lu, Bo Xu, Xiaokun Zhang, Changrong Min, Liang Yang, Hongfei Lin
SocialDial: A Benchmark for Socially-Aware Dialogue Systems
Haolan Zhan, Zhuang Li, Yufei Wang, Linhao Luo, Tao Feng, Xiaoxi Kang, Yuncheng Hua, Lizhen Qu, Lay-Ki Soon, Suraj Sharma, Ingrid Zukerman, Zhaleh Semnani-Azad, Gholamreza Haffari
A Benchmark for Cycling Close Pass Near Miss Event Detection from Video Streams
Mingjie Li, Tharindu Rathnayake, Ben Beck, Lingheng Meng, Zijue Chen, Akansel Cosgun, Xiaojun Chang, Dana Kulić