Inherent Interpretability
Inherent interpretability in machine learning focuses on designing models and methods that are inherently transparent and understandable, aiming to reduce the "black box" nature of many AI systems. Current research emphasizes developing intrinsically interpretable model architectures, such as those based on decision trees, rule-based systems, and specific neural network designs (e.g., Kolmogorov-Arnold Networks), alongside techniques like feature attribution and visualization methods to enhance understanding of model behavior. This pursuit is crucial for building trust in AI, particularly in high-stakes applications like healthcare and finance, where understanding model decisions is paramount for responsible deployment and effective human-AI collaboration.
Papers
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models
Xintong Wang, Jingheng Pan, Longqin Jiang, Liang Ding, Xingshan Li, Chris Biemann
ProtoLens: Advancing Prototype Learning for Fine-Grained Interpretability in Text Classification
Bowen Wei, Ziwei Zhu
SignAttention: On the Interpretability of Transformer Models for Sign Language Translation
Pedro Alejandro Dal Bianco, Oscar Agustín Stanchi, Facundo Manuel Quiroga, Franco Ronchetti, Enzo Ferrante
Interpreting Microbiome Relative Abundance Data Using Symbolic Regression
Swagatam Haldar, Christoph Stein-Thoeringer, Vadim Borisov
Reproducibility study of "LICO: Explainable Models with Language-Image Consistency"
Luan Fletcher, Robert van der Klis, Martin Sedláček, Stefan Vasilev, Christos Athanasiadis
Automatically Interpreting Millions of Features in Large Language Models
Gonçalo Paulo, Alex Mallen, Caden Juang, Nora Belrose
Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers
Patrik Zavoral, Dušan Variš, Ondřej Bojar
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
FragNet: A Graph Neural Network for Molecular Property Prediction with Four Layers of Interpretability
Gihan Panapitiya, Peiyuan Gao, C Mark Maupin, Emily G Saldanha
Can sparse autoencoders make sense of latent representations?
Viktoria Schuster
On Championing Foundation Models: From Explainability to Interpretability
Shi Fu, Yuzhu Chen, Yingjie Wang, Dacheng Tao
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde, Michael T. Pearce, Lee Sharkey