Interpretability Method

Interpretability methods aim to make the decision-making processes of complex machine learning models, particularly deep learning models like transformers and convolutional neural networks, more transparent and understandable. Current research focuses on developing and evaluating techniques that explain model predictions, including methods based on attention mechanisms, counterfactual generation, and the analysis of internal model representations (e.g., neuron activations, embeddings). These efforts are crucial for building trust in AI systems, improving model debugging and refinement, and enabling responsible deployment in high-stakes applications such as healthcare and finance. A significant challenge lies in developing robust and reliable methods that generalize across different model architectures and datasets, and in establishing objective evaluation criteria for interpretability.

Papers

May 8, 2024

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Joshua Clymer, Caden Juang, Severin Field
Large Language Model Interpretability Method Fake Alignment

April 4, 2024

April 3, 2024

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell
Convolutional Neural Network Interpretability Method Concept Based Digital Innovation Interpretability Tool

February 27, 2024

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
Language Model Interpretability Method Speech Representation Disentanglement Neuron Level Automatic Alignment

February 20, 2024

February 19, 2024

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Aryaman Arora, Dan Jurafsky, Christopher Potts
Language Model Model Interpretability Interpretability Method Psycholinguistic Research Language Task Interpretable Causal

February 18, 2024

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf
Large Language Model Language Model High Quality Counterfactuals Competition Platform Functional Mechanism Interpretability Method Interpretability Research Semi Active Mechanism

January 11, 2024

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva
Large Language Model Language Model Natural Language Internal Representation Interpretability Method Unifying Framework Hidden Representation

January 5, 2024

Model-Agnostic Interpretation Framework in Machine Learning: A Comparative Study in NBA Sports
Shun Liu
Machine Learning Deep Learning Model Inherent Interpretability Comparative Study Model Agnostic Interpretability Method Complex Model

December 4, 2023

Class-Discriminative Attention Maps for Vision Transformers
Lennart Brocki, Jakub Binda, Neo Christopher Chung
Vision Transformer Semantic Segmentation Attention Map Interpretability Method

December 3, 2023

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars
Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski
Transformer Megatron Decepticons Synthetic Data Case Study Interpretability Method Attention Pattern Context Free Built in Interpretability Continual Task

November 16, 2023

Inherently Interpretable Time Series Classification via Multiple Instance Learning
Joseph Early, Gavin KC Cheung, Kurt Cutajar, Hanting Xie, Jas Kandola, Niall Twomey
Time Series Multiple Instance Learning Time Series Classification Interpretability Method

October 6, 2023

Demystifying Embedding Spaces using Large Language Models
Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Jihwan Jeong, Lior Shani, Azamat Tulepbergenov, Deepak Ramachandran, Martin Mladenov, Craig Boutilier
Jina Embeddings Interpretability Method Mixed Space Concept Activation Vector Information Facet

October 3, 2023

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Albert Garde, Esben Kran, Fazl Barez
Large Language Model Interpretability Method Information Access Neuron Level

September 16, 2023

Extracting Interpretable Local and Global Representations from Attention on Time Series
Leonid Schwenke, Martin Atzmueller
Time Series Inherent Interpretability High Explainability Human Attention Interpretability Method Transformer Attention Global Representation Local Interpretability

September 7, 2023

FIND: A Function Description Benchmark for Evaluating Interpretability Methods
Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba
Language Model Interpretability Method Interpretable Algorithm Explainable System Benchmark Function

July 16, 2023

SHAMSUL: Systematic Holistic Analysis to investigate Medical Significance Utilizing Local interpretability methods in deep learning for chest radiography pathology prediction
Mahbub Ul Alam, Jaakko Hollmén, Jón Rúnar Baldvinsson, Rahim Rahmani
Deep Learning Inherent Interpretability Interpretability Method Clinical Application Local Interpretability Local Interpretable

June 15, 2023

Towards Interpretability in Audio and Visual Affective Machine Learning: A Review
David S. Johnson, Olya Hakobyan, Hanna Drimalla
Inherent Interpretability Narrative Review Audio Driven Affective Computing Interpretability Method

Interpretability Method

Papers

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Explaining Explainability: Understanding Concept Activation Vectors

Eigenpruning: an Interpretability-Inspired PEFT Method

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space

Guarantee Regions for Local Explanations

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Model-Agnostic Interpretation Framework in Machine Learning: A Comparative Study in NBA Sports

Class-Discriminative Attention Maps for Vision Transformers

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

Inherently Interpretable Time Series Classification via Multiple Instance Learning

Demystifying Embedding Spaces using Large Language Models

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

Extracting Interpretable Local and Global Representations from Attention on Time Series

FIND: A Function Description Benchmark for Evaluating Interpretability Methods

SHAMSUL: Systematic Holistic Analysis to investigate Medical Significance Utilizing Local interpretability methods in deep learning for chest radiography pathology prediction

Towards Interpretability in Audio and Visual Affective Machine Learning: A Review