Concept Attribution

Concept attribution aims to explain how machine learning models arrive at their predictions by identifying the key concepts or features driving those decisions. Current research focuses on developing robust and reliable methods for attributing predictions to concepts, employing techniques like integrated gradients and non-negative matrix factorization across various model architectures, including transformers and convolutional neural networks. This work is crucial for enhancing the transparency and trustworthiness of AI systems, particularly in high-stakes applications where understanding model behavior is paramount, and for addressing issues like bias and adversarial attacks. Improved concept attribution methods promise to facilitate more effective model debugging, fairness analysis, and ultimately, the development of more reliable and explainable AI.

Papers