Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
Resources and Evaluations for Multi-Distribution Dense Information Retrieval
Soumya Chatterjee, Omar Khattab, Simran Arora
On Evaluation of Document Classification using RVL-CDIP
Stefan Larson, Gordon Lim, Kevin Leach
Evaluation of Popular XAI Applied to Clinical Prediction Models: Can They be Trusted?
Aida Brankovic, David Cook, Jessica Rahman, Wenjie Huang, Sankalp Khanna
Evaluation of Chinese-English Machine Translation of Emotion-Loaded Microblog Texts: A Human Annotated Dataset for the Quality Assessment of Emotion Translation
Shenbin Qian, Constantin Orasan, Felix do Carmo, Qiuliang Li, Diptesh Kanojia
Deep Learning Methods for Retinal Blood Vessel Segmentation: Evaluation on Images with Retinopathy of Prematurity
Gorana Gojić, Veljko Petrović, Radovan Turović, Dinu Dragan, Ana Oros, Dušan Gajić, Nebojša Horvat
Comprehensive Training and Evaluation on Deep Reinforcement Learning for Automated Driving in Various Simulated Driving Maneuvers
Yongqi Dong, Tobias Datema, Vincent Wassenaar, Joris van de Weg, Cahit Tolga Kopar, Harim Suleman
GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization
Yang Janet Liu, Amir Zeldes
Good, but not always Fair: An Evaluation of Gender Bias for three commercial Machine Translation Systems
Silvia Alma Piazzolla, Beatrice Savoldi, Luisa Bentivogli
SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and its Evaluation
Md. Ekramul Islam, Labib Chowdhury, Faisal Ahamed Khan, Shazzad Hossain, Sourave Hossain, Mohammad Mamun Or Rashid, Nabeel Mohammed, Mohammad Ruhul Amin
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, Maosong Sun
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers
Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang