Interpretability Tool

Interpretability tools aim to make the inner workings of complex machine learning models, particularly deep neural networks and large language models, more transparent and understandable. Current research focuses on developing methods to explain model decisions for various architectures, including convolutional neural networks (CNNs) and transformers, often employing techniques like feature attribution and dialogue-based explanations. This work is crucial for building trust in AI systems, improving model debugging and design, and facilitating responsible deployment across diverse applications, from healthcare to finance. The ultimate goal is to move beyond simply identifying model outputs to understanding the reasoning behind them.

12papers

Papers

March 20, 2025

Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
Zhaoxin Li, Zhang Xi-Jia, Batuhan Altundas, Letian Chen, Rohan Paleja, Matthew Gombolay
Georgia Institute of Technology●Purdue University
Reinforcement Learning Interpretability Tool Vision Language Model

March 19, 2025

Explainable AI Components for Narrative Map Extraction
Brian Keith, Fausto German, Eric Krokos, Sarah Joseph, Chris North
Narrative Extraction Interpretability Tool

February 21, 2025

LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models
Hugo Pitorro, Marcos Treviso
Instituto de Telecomunicações
Token Interaction Long Context Interpretability Tool Mamba Based Model

December 20, 2024

Post-hoc Interpretability Illumination for Scientific Interaction Discovery
Ling Zhang, Zhichao Hou, Tingxiang Ji, Yuanyuan Xu, Runze Li
Explainable Model Scientific Discovery Order Interaction Hoc Interpretability Interpretability Tool Model Interpretability High Explainability

December 18, 2024

Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
Ido Cohen, Daniela Gottesman, Mor Geva, Raja Giryes
Vision Encoders Vision Language Model Interpretability Tool Different Modality Image Token Language Model Performance Gap

November 17, 2024

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu, Sophia Ananiadou
Visual Question Answering LLaVA HD Mechanistic Interpretability Interpretability Tool Multi Modal Large Language Model Color Detection

October 9, 2024

Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond
Dilyara Bareeva, Galip Ümit Yolcu, Anna Hedström, Niklas Schmolenski, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Training Data Attribution Feature Attribution Method Data Attribution Interpretability Tool Quantitative Finance

April 3, 2024

January 23, 2024

LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations
Qianli Wang, Tatiana Anikina, Nils Feldhus, Josef van Genabith, Leonhard Hennig, Sebastian Möller
Conversational Context Self Explanation Interpretability Tool Explainability Tool Explainable AI Dialogue Based Explanation

January 22, 2024

Admission Prediction in Undergraduate Applications: an Interpretable Deep Learning Approach
Amisha Priyadarshini, Barbara Martinez-Neda, Sergio Gago-Masague
Deep Learning Based Reviewer Selection Undergraduate Admission Exam Learning Based Approach Interpretability Tool Interpretable Deep

October 9, 2023

Enhancing Interpretability and Generalizability in Extended Isolation Forests
Alessio Arcudi, Davide Frizzo, Chiara Masiero, Gian Antonio Susto
Isolation Forest Inherent Interpretability Improved Generalizability Anomaly Detection Interpretability Tool Unsupervised Setting

July 14, 2023

Looking deeper into interpretable deep learning in neuroimaging: a comprehensive survey
Md. Mahfuzur Rahman, Vince D. Calhoun, Sergey M. Plis
Deep Learning Model Model Interpretability Interpretable Deep Learning Comprehensive Survey Built in Interpretability Interpretability Tool

February 27, 2023

Inseq: An Interpretability Toolkit for Sequence Generation Models
Gabriele Sarti, Nils Feldhus, Ludwig Sickert, Oskar van der Wal, Malvina Nissim, Arianna Bisazza
Interpretability Tool Natural Language Processing Machine Translation Model Sequence Generation Model Interpretability Analysis

February 8, 2023

Red Teaming Deep Neural Networks with Feature Synthesis Tools
Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell
Interpretability Method Feature Synthesis Red Teaming Interpretability Tool Model Interpretation

October 25, 2022

Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics
Ben Schaper, Christopher Lohse, Marcell Streile, Andrea Giovannini, Richard Osuala
Generated Summary Summarization Quality Contextual Embeddings Evaluation Metric Capital Allocation Interpretability Tool Higher Quality Reference

July 27, 2022

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell
Interpretability Tool Interpretability Research Inner Structure Deep Neural Network Transparent AI DNN Interpretation Timely Survey

March 25, 2022

Preprocessing Reward Functions for Interpretability
Erik Jenner, Adam Gleave
Interpretability Tool Reward Function Inherent Interpretability

Interpretability Tool

Papers

Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models

Explainable AI Components for Narrative Map Extraction

LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models

Post-hoc Interpretability Illumination for Scientific Interaction Discovery

Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations

Admission Prediction in Undergraduate Applications: an Interpretable Deep Learning Approach

Enhancing Interpretability and Generalizability in Extended Isolation Forests

Looking deeper into interpretable deep learning in neuroimaging: a comprehensive survey

Inseq: An Interpretability Toolkit for Sequence Generation Models

Red Teaming Deep Neural Networks with Feature Synthesis Tools

Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Preprocessing Reward Functions for Interpretability