Explanation Hacking
Explanation hacking refers to the vulnerability of explainable AI (XAI) methods to manipulation, where minor input changes can drastically alter the explanation without affecting the model's prediction. Current research focuses on identifying these vulnerabilities across various model architectures, including deep neural networks (DNNs) and transformers, using adversarial attacks to demonstrate the fragility of explanations in both image and text classification. This research highlights the critical need for robust and reliable XAI methods, impacting the trustworthiness and ethical deployment of AI systems in high-stakes applications.
Papers
October 17, 2024
March 22, 2024
November 27, 2022
June 7, 2022