Feature Attribution
Feature attribution aims to explain the predictions of complex machine learning models by identifying which input features most significantly influence the output. Current research focuses on developing and evaluating various attribution methods, including gradient-based approaches like Integrated Gradients and game-theoretic methods like SHAP, often applied to deep neural networks (including transformers) and other architectures like Siamese encoders. These efforts address challenges such as faithfulness (accuracy of attributions), robustness (consistency under perturbations), and computational efficiency, ultimately seeking to improve model transparency and trustworthiness for applications ranging from medical diagnosis to scientific discovery.
Papers
A Robust Unsupervised Ensemble of Feature-Based Explanations using Restricted Boltzmann Machines
Vadim Borisov, Johannes Meier, Johan van den Heuvel, Hamed Jalali, Gjergji Kasneci
"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification
Jasmijn Bastings, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, Katja Filippova