Global Evaluation
Global evaluation in various scientific domains focuses on developing robust and reliable methods for assessing the performance of models and systems, often addressing challenges in data diversity, evolving data distributions, and the need for human-centered metrics. Current research emphasizes the development of comprehensive benchmarks and evaluation frameworks, often incorporating techniques like Item Response Theory and multi-faceted metrics beyond simple accuracy, and utilizing diverse model architectures including Large Language Models (LLMs), Convolutional Neural Networks (CNNs), and Graph Neural Networks (GNNs). These advancements are crucial for ensuring the trustworthiness and effectiveness of AI systems across diverse applications, from medical diagnosis to autonomous driving, and for fostering reproducible and comparable research within the scientific community.
Papers
VERA: Validation and Evaluation of Retrieval-Augmented Systems
Tianyu Ding, Adi Banerjee, Laurent Mombaerts, Yunhong Li, Tarik Borogovac, Juan Pablo De la Cruz Weinstein
DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts
Xiongtao Sun, Gan Liu, Zhipeng He, Hui Li, Xiaoguang Li
What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain
Antonis Maronikolakis, Ana Peleteiro Ramallo, Weiwei Cheng, Thomas Kober
Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time
Marisa Hudspeth, Brendan O'Connor, Laure Thompson
Towards Using Multiple Iterated, Reproduced, and Replicated Experiments with Robots (MIRRER) for Evaluation and Benchmarking
Adam Norton, Brian Flynn
Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation
Nicy Scaria, Suma Dharani Chenna, Deepak Subramani
CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models
Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, Joshua Saxe
The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines
Matias Martinez