Attribution Attack

Attribution attacks target the explainability methods used to understand deep learning models, aiming to manipulate these explanations without significantly altering the model's predictions. Current research focuses on improving the accuracy of neuron importance estimations within these attacks, developing more transferable attacks across different model architectures, and exploring vulnerabilities in secure aggregation protocols used in federated learning. These attacks highlight the fragility of model interpretability and pose significant security risks, particularly in sensitive applications where trust in model explanations is crucial, driving efforts to develop more robust explanation methods and defensive strategies.

Papers