Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations
Vaibhav Bihani, Utkarsh Pratiush, Sajid Mannan, Tao Du, Zhimin Chen, Santiago Miret, Matthieu Micoulaut, Morten M Smedskjaer, Sayan Ranu, N M Anoop Krishnan
Jury: A Comprehensive Evaluation Toolkit
Devrim Cavusoglu, Secil Sen, Ulas Sert, Sinan Altinuc
An evaluation of pre-trained models for feature extraction in image classification
Erick da Silva Puls, Matheus V. Todescato, Joel L. Carbonera
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
Elizabeth Seger, Noemi Dreksler, Richard Moulange, Emily Dardaman, Jonas Schuett, K. Wei, Christoph Winter, Mackenzie Arnold, Seán Ó hÉigeartaigh, Anton Korinek, Markus Anderljung, Ben Bucknall, Alan Chan, Eoghan Stafford, Leonie Koessler, Aviv Ovadya, Ben Garfinkel, Emma Bluemke, Michael Aird, Patrick Levermore, Julian Hazell, Abhishek Gupta
An evaluation of GPT models for phenotype concept recognition
Tudor Groza, Harry Caufield, Dylan Gration, Gareth Baynam, Melissa A Haendel, Peter N Robinson, Christopher J Mungall, Justin T Reese
Design and Evaluation of Motion Planners for Quadrotors in Environments with Varying Complexities
Yifei Simon Shao, Yuwei Wu, Laura Jarin-Lipschitz, Pratik Chaudhari, Vijay Kumar
Skill Check: Some Considerations on the Evaluation of Gamemastering Models for Role-playing Games
Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás
Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models
Jiaying Lu, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Baochen Sun, Carl Yang, Jie Yang
Evaluation of large language models for discovery of gene set function
Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, Dylan Fong, Kevin Smith, Robin Bachelder, Trey Ideker, Dexter Pratt
Evaluating Deep Learning-based Melanoma Classification using Immunohistochemistry and Routine Histology: A Three Center Study
Christoph Wies, Lucas Schneider, Sarah Haggenmueller, Tabea-Clara Bucher, Sarah Hobelsberger, Markus V. Heppt, Gerardo Ferrara, Eva I. Krieghoff-Henning, Titus J. Brinker