Human Evaluation
Human evaluation in the field of artificial intelligence, particularly concerning large language models (LLMs), focuses on developing reliable and efficient methods to assess model performance against human expectations. Current research emphasizes creating standardized evaluation frameworks, often incorporating LLM-as-a-judge approaches to automate the process, while simultaneously addressing biases and inconsistencies in both human and automated assessments. This work is crucial for improving the trustworthiness and practical applicability of LLMs across diverse domains, from medical diagnosis to scientific synthesis, by ensuring that AI systems align with human needs and values. The development of robust evaluation methods is essential for responsible AI development and deployment.
Papers
ProtSi: Prototypical Siamese Network with Data Augmentation for Few-Shot Subjective Answer Evaluation
Yining Lu, Jingxi Qiu, Gaurav Gupta
Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation
Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, Ehud Reiter