G Eval
G-Eval, and related evaluation frameworks, address the critical need for robust and reliable methods to assess the performance of large language models (LLMs). Current research focuses on developing comprehensive benchmarks that evaluate LLMs across diverse tasks and domains, including safety, mathematical reasoning, and multilingual capabilities, often employing LLMs themselves as evaluators or incorporating hierarchical criteria decomposition. These advancements are crucial for improving LLM development, fostering fairer comparisons between models, and ensuring the responsible deployment of these powerful technologies in various applications.
Papers
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos, Seungone Kim
CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models
Wenhong Zhu, Hongkun Hao, Zhiwei He, Yunze Song, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, Hongyuan Lu
X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects
Minqian Liu, Ying Shen, Zhiyang Xu, Yixin Cao, Eunah Cho, Vaibhav Kumar, Reza Ghanadan, Lifu Huang