LLM Based Evaluation
LLM-based evaluation focuses on using large language models (LLMs) to assess the performance of other LLMs, automating a traditionally labor-intensive process. Current research emphasizes improving the reliability and interpretability of these evaluations, exploring techniques like checklist generation, prompt engineering, and the integration of multiple LLM evaluators to achieve higher agreement with human judgments across diverse tasks and languages. This field is crucial for advancing LLM development and deployment, enabling more objective comparisons of model capabilities and identifying biases or weaknesses in existing models, ultimately leading to more robust and beneficial AI systems.
Papers
A Course Shared Task on Evaluating LLM Output for Clinical Questions
Yufang Hou, Thy Thy Tran, Doan Nam Long Vu, Yiwen Cao, Kai Li, Lukas Rohde, Iryna Gurevych
KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making
Gilang Fajar Febrian, Grazziela Figueredo