Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models
Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven
Evaluation of pretrained language models on music understanding
Yannis Vasilakis, Rachel Bittner, Johan Pauwels
Foundation Model or Finetune? Evaluation of few-shot semantic segmentation for river pollution
Marga Don, Stijn Pinson, Blanca Guillen Cebrian, Yuki M. Asano
Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation
Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li
Initial Development and Evaluation of the Creative Artificial Intelligence through Recurring Developments and Determinations (CAIRDD) System
Jeremy Straub, Zach Johnson
Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems
Sanjita Prajapati, Tanu Singh, Chinmay Hegde, Pranamesh Chakraborty