Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
A Critical Evaluation of Evaluations for Long-form Question Answering
Fangyuan Xu, Yixiao Song, Mohit Iyyer, Eunsol Choi
Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme Recognition
Xiaoliang Wu, Peter Bell, Ajitha Rajan
SEIP: Simulation-based Design and Evaluation of Infrastructure-based Collective Perception
Ao Qu, Xuhuan Huang, Dajiang Suo
Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation
Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, Yanai Elazar
Evaluation of Question Generation Needs More References
Shinhyeok Oh, Hyojun Go, Hyeongdon Moon, Yunsung Lee, Myeongho Jeong, Hyun Seung Lee, Seungtaek Choi
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Niels Mündler, Jingxuan He, Slobodan Jenko, Martin Vechev
An Experimental Investigation into the Evaluation of Explainability Methods
Sédrick Stassin, Alexandre Englebert, Géraldin Nanfack, Julien Albert, Nassim Versbraegen, Gilles Peiffer, Miriam Doh, Nicolas Riche, Benoît Frenay, Christophe De Vleeschouwer
Visual Programming for Text-to-Image Generation and Evaluation
Jaemin Cho, Abhay Zala, Mohit Bansal
PLCMOS -- a data-driven non-intrusive metric for the evaluation of packet loss concealment algorithms
Lorenz Diener, Marju Purin, Sten Sootla, Ando Saabas, Robert Aichner, Ross Cutler
LoopBoxes -- Evaluation of a Collaborative Accessible Digital Musical Instrument
Andreas Förster, Alarith Uhde, Mathias Komesker, Christina Komesker, Irina Schmidt
Don't Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting
Akhila Yerukola, Xuhui Zhou, Elizabeth Clark, Maarten Sap
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response
Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze
Scientific Opinion Summarization: Paper Meta-review Generation Dataset, Methods, and Evaluation
Qi Zeng, Mankeerat Sidhu, Ansel Blume, Hou Pong Chan, Lu Wang, Heng Ji
Evaluation of African American Language Bias in Natural Language Generation
Nicholas Deas, Jessi Grieser, Shana Kleiner, Desmond Patton, Elsbeth Turcan, Kathleen McKeown
LLM-empowered Chatbots for Psychiatrist and Patient Simulation: Application and Evaluation
Siyuan Chen, Mengyue Wu, Kenny Q. Zhu, Kunyao Lan, Zhiling Zhang, Lyuchun Cui
Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method
Yiming Wang, Zhuosheng Zhang, Rui Wang
A study of conceptual language similarity: comparison and evaluation
Haotian Ye, Yihong Liu, Hinrich Schütze
Efficient Large-Scale Visual Representation Learning And Evaluation
Eden Dolev, Alaa Awad, Denisa Roberts, Zahra Ebrahimzadeh, Marcin Mejran, Vaibhav Malpani, Mahir Yavuz
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models
Xiaolei Wang, Xinyu Tang, Wayne Xin Zhao, Jingyuan Wang, Ji-Rong Wen