Inherent Interpretability
Inherent interpretability in machine learning focuses on designing models and methods that are inherently transparent and understandable, aiming to reduce the "black box" nature of many AI systems. Current research emphasizes developing intrinsically interpretable model architectures, such as those based on decision trees, rule-based systems, and specific neural network designs (e.g., Kolmogorov-Arnold Networks), alongside techniques like feature attribution and visualization methods to enhance understanding of model behavior. This pursuit is crucial for building trust in AI, particularly in high-stakes applications like healthcare and finance, where understanding model decisions is paramount for responsible deployment and effective human-AI collaboration.
Papers
Prototype-Based Interpretability for Legal Citation Prediction
Chu Fei Luo, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu
Optimization and Interpretability of Graph Attention Networks for Small Sparse Graph Structures in Automotive Applications
Marion Neumeier, Andreas Tollkühn, Sebastian Dorn, Michael Botsch, Wolfgang Utschick
Language Models Implement Simple Word2Vec-style Vector Arithmetic
Jack Merullo, Carsten Eickhoff, Ellie Pavlick
Concept-Centric Transformers: Enhancing Model Interpretability through Object-Centric Concept Learning within a Shared Global Workspace
Jinyung Hong, Keun Hee Park, Theodore P. Pavlic