Inherent Interpretability
Inherent interpretability in machine learning focuses on designing models and methods that are inherently transparent and understandable, aiming to reduce the "black box" nature of many AI systems. Current research emphasizes developing intrinsically interpretable model architectures, such as those based on decision trees, rule-based systems, and specific neural network designs (e.g., Kolmogorov-Arnold Networks), alongside techniques like feature attribution and visualization methods to enhance understanding of model behavior. This pursuit is crucial for building trust in AI, particularly in high-stakes applications like healthcare and finance, where understanding model decisions is paramount for responsible deployment and effective human-AI collaboration.
Papers
CAT: Interpretable Concept-based Taylor Additive Models
Viet Duong, Qiong Wu, Zhengyi Zhou, Hongjue Zhao, Chenxiang Luo, Eric Zavesky, Huaxiu Yao, Huajie Shao
Towards Compositional Interpretability for XAI
Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke
Large Language Models are Interpretable Learners
Ruochen Wang, Si Si, Felix Yu, Dorothea Wiesmann, Cho-Jui Hsieh, Inderjit Dhillon
Self-supervised Interpretable Concept-based Models for Text Classification
Francesco De Santis, Philippe Bich, Gabriele Ciravegna, Pietro Barbiero, Danilo Giordano, Tania Cerquitelli
REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability
Shuang Ao, Simon Khan, Haris Aziz, Flora D. Salim