Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond
Jiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe
Edge Computing-Enabled Road Condition Monitoring: System Development and Evaluation
Abdulateef Daud, Mark Amo-Boateng, Neema Jakisa Owor, Armstrong Aboah, Yaw Adu-Gyamfi
EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations
Vaibhav Bihani, Utkarsh Pratiush, Sajid Mannan, Tao Du, Zhimin Chen, Santiago Miret, Matthieu Micoulaut, Morten M Smedskjaer, Sayan Ranu, N M Anoop Krishnan
Jury: A Comprehensive Evaluation Toolkit
Devrim Cavusoglu, Secil Sen, Ulas Sert, Sinan Altinuc
An evaluation of pre-trained models for feature extraction in image classification
Erick da Silva Puls, Matheus V. Todescato, Joel L. Carbonera
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
Elizabeth Seger, Noemi Dreksler, Richard Moulange, Emily Dardaman, Jonas Schuett, K. Wei, Christoph Winter, Mackenzie Arnold, Seán Ó hÉigeartaigh, Anton Korinek, Markus Anderljung, Ben Bucknall, Alan Chan, Eoghan Stafford, Leonie Koessler, Aviv Ovadya, Ben Garfinkel, Emma Bluemke, Michael Aird, Patrick Levermore, Julian Hazell, Abhishek Gupta
An evaluation of GPT models for phenotype concept recognition
Tudor Groza, Harry Caufield, Dylan Gration, Gareth Baynam, Melissa A Haendel, Peter N Robinson, Christopher J Mungall, Justin T Reese
Design and Evaluation of Motion Planners for Quadrotors in Environments with Varying Complexities
Yifei Simon Shao, Yuwei Wu, Laura Jarin-Lipschitz, Pratik Chaudhari, Vijay Kumar
Skill Check: Some Considerations on the Evaluation of Gamemastering Models for Role-playing Games
Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás