Post Hoc Interpretability

Post-hoc interpretability aims to explain the decisions of already-trained machine learning models, particularly deep neural networks (DNNs), by analyzing their internal workings after training is complete. Current research focuses on developing and evaluating methods like LIME, Grad-CAM, SHAP, and novel approaches tailored to specific model architectures (e.g., graph neural networks, convolutional neural networks) and data types (e.g., images, text, time series). Improving the reliability and robustness of these methods, including addressing issues like perturbation artifacts and the faithful representation of feature interactions, is crucial for building trust and facilitating the adoption of AI in high-stakes applications such as healthcare and nuclear safety.

Papers