Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing
Michael Neely, Stefan F. Schouten, Maurits Bleeker, Ana Lucic
The Construction and Evaluation of the LEAFTOP Dataset of Automatically Extracted Nouns in 1480 Languages
Greg Baker, Diego Molla-Aliod
Towards Practical Physics-Informed ML Design and Evaluation for Power Grid
Shimiao Li, Amritanshu Pandey, Larry Pileggi
Evaluation of a User Authentication Schema Using Behavioral Biometrics and Machine Learning
Laura Pryor, Jacob Mallet, Rushit Dave, Naeem Seliya, Mounika Vanamala, Evelyn Sowells Boone
GAM(e) changer or not? An evaluation of interpretable machine learning models based on additive model constraints
Patrick Zschech, Sven Weinzierl, Nico Hambauer, Sandra Zilker, Mathias Kraus
UID2021: An Underwater Image Dataset for Evaluation of No-reference Quality Assessment Metrics
Guojia Hou, Yuxuan Li, Huan Yang, Kunqian Li, Zhenkuan Pan
Training and Evaluation of Deep Policies using Reinforcement Learning and Generative Models
Ali Ghadirzadeh, Petra Poklukar, Karol Arndt, Chelsea Finn, Ville Kyrki, Danica Kragic, Mårten Björkman
NFT Appraisal Prediction: Utilizing Search Trends, Public Market Data, Linear Regression and Recurrent Neural Networks
Shrey Jain, Camille Bruckmann, Chase McDougall
Learning Performance Graphs from Demonstrations via Task-Based Evaluations
Aniruddh G. Puranic, Jyotirmoy V. Deshmukh, Stefanos Nikolaidis
EVOPS Benchmark: Evaluation of Plane Segmentation from RGBD and LiDAR Data
Anastasiia Kornilova, Dmitrii Iarosh, Denis Kukushkin, Nikolai Goncharov, Pavel Mokeev, Arthur Saliou, Gonzalo Ferrer