Explanation Hacking

Explanation hacking refers to the vulnerability of explainable AI (XAI) methods to manipulation, where minor input changes can drastically alter the explanation without affecting the model's prediction. Current research focuses on identifying these vulnerabilities across various model architectures, including deep neural networks (DNNs) and transformers, using adversarial attacks to demonstrate the fragility of explanations in both image and text classification. This research highlights the critical need for robust and reliable XAI methods, impacting the trustworthiness and ethical deployment of AI systems in high-stakes applications.

Papers