LLM Based Evaluation
LLM-based evaluation focuses on using large language models (LLMs) to assess the performance of other LLMs, automating a traditionally labor-intensive process. Current research emphasizes improving the reliability and interpretability of these evaluations, exploring techniques like checklist generation, prompt engineering, and the integration of multiple LLM evaluators to achieve higher agreement with human judgments across diverse tasks and languages. This field is crucial for advancing LLM development and deployment, enabling more objective comparisons of model capabilities and identifying biases or weaknesses in existing models, ultimately leading to more robust and beneficial AI systems.
Papers
Are LLM-based Evaluators Confusing NLG Quality Criteria?
Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan
Understanding the Therapeutic Relationship between Counselors and Clients in Online Text-based Counseling using LLMs
Anqi Li, Yu Lu, Nirui Song, Shuai Zhang, Lizhi Ma, Zhenzhong Lan
Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4
Ting Fang Tan, Kabilan Elangovan, Liyuan Jin, Yao Jie, Li Yong, Joshua Lim, Stanley Poh, Wei Yan Ng, Daniel Lim, Yuhe Ke, Nan Liu, Daniel Shu Wei Ting
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
Timothy R. McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Paul Watters, Malka N. Halgamuge