Feature Attribution
Feature attribution aims to explain the predictions of complex machine learning models by identifying which input features most significantly influence the output. Current research focuses on developing and evaluating various attribution methods, including gradient-based approaches like Integrated Gradients and game-theoretic methods like SHAP, often applied to deep neural networks (including transformers) and other architectures like Siamese encoders. These efforts address challenges such as faithfulness (accuracy of attributions), robustness (consistency under perturbations), and computational efficiency, ultimately seeking to improve model transparency and trustworthiness for applications ranging from medical diagnosis to scientific discovery.
Papers
Argument Attribution Explanations in Quantitative Bipolar Argumentation Frameworks (Technical Report)
Xiang Yin, Nico Potyka, Francesca Toni
Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions
Skyler Wu, Eric Meng Shen, Charumathi Badrinath, Jiaqi Ma, Himabindu Lakkaraju