Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
Evaluation of Code LLMs on Geospatial Code Generation
Piotr Gramacki, Bruno Martins, Piotr Szymański
An evaluation of LLM code generation capabilities through graded exercises
Álvaro Barbero Jiménez
DABI: Evaluation of Data Augmentation Methods Using Downsampling in Bilateral Control-Based Imitation Learning with Images
Masato Kobayashi, Thanpimon Buamanee, Yuki Uranishi
Evaluation of Security of ML-based Watermarking: Copy and Removal Attacks
Vitaliy Kinakh, Brian Pulfer, Yury Belousov, Pierre Fernandez, Teddy Furon, Slava Voloshynovskiy
Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review
Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen K. Wong, Graham Wills, Elliot First, Frank J. Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar
Design and Evaluation of a CDSS for Drug Allergy Management Using LLMs and Pharmaceutical Data Integration
Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
Evaluation of state-of-the-art ASR Models in Child-Adult Interactions
Aditya Ashvin, Rimita Lahiri, Aditya Kommineni, Somer Bishop, Catherine Lord, Sudarsana Reddy Kadiri, Shrikanth Narayanan