Novel Benchmark
Novel benchmarks are being developed to rigorously evaluate the performance of large language models (LLMs) and other AI models across diverse tasks, addressing limitations in existing evaluation methods. Current research focuses on creating benchmarks that assess capabilities in areas such as code generation, multimodal reasoning, and handling complex real-world scenarios, often incorporating diverse data sources and evaluating robustness to various factors like language variations and data distribution shifts. These improved benchmarks are crucial for advancing the field by providing more accurate and comprehensive assessments of model performance, ultimately leading to the development of more reliable and effective AI systems for various applications.
Papers
A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models
Elena Kardanova, Alina Ivanova, Ksenia Tarasova, Taras Pashchenko, Aleksei Tikhoniuk, Elen Yusupova, Anatoly Kasprzhak, Yaroslav Kuzminov, Ekaterina Kruchinskaia, Irina Brun (National Research University Higher School of Economics, Moscow, Russia)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
Yutao Mou, Shikun Zhang, Wei Ye