LLM Benchmark
LLM benchmarking aims to objectively evaluate the capabilities of large language models across diverse tasks, addressing limitations of existing methods like static datasets and potential biases in human or LLM evaluation. Current research focuses on developing more robust and dynamic benchmarks, including those based on real-world interactions, game-based competitions, and knowledge-grounded evaluations, often incorporating techniques like prompt engineering and multi-agent coordination. These efforts are crucial for fostering the responsible development and deployment of LLMs, improving model transparency, and guiding future research directions in AI.
Papers
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Hui Huang, Weixun Wang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Xuepeng Liu, Dekai Sun, Wenbo Su, Bo Zheng
Benchmarking LLMs' Judgments with No Gold Standard
Shengwei Xu, Yuxuan Lu, Grant Schoenebeck, Yuqing Kong