Sparse Autoencoders

Sparse autoencoders (SAEs) are unsupervised machine learning models designed to extract interpretable features from complex neural network activations by learning sparse, low-dimensional representations of high-dimensional data. Current research focuses on applying SAEs to understand the inner workings of large language models and other deep learning architectures, employing variations like JumpReLU and Gated SAEs to improve reconstruction fidelity and sparsity. This work is significant for advancing mechanistic interpretability, enabling better understanding of model behavior and potentially leading to improved model control and more reliable applications in diverse fields like healthcare and scientific discovery.

Papers

October 9, 2024

October 8, 2024

CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept
YuXuan Wu, Bonaventure F. P. Dossou, Dianbo Liu
Language Model Unlearning Framework Unlearning Method Sparse Autoencoders Zero Shot Unlearning

October 4, 2024

An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
Ahmed Abdulaal, Hugo Fry, Nina Montaña-Brown, Ayodeji Ijishakin, Jack Gao, Stephanie Hyland, Daniel C. Alexander, Daniel C. Castro
Vision Language Model Latent Representation Feature Wise X Ray Radiology Report Generation Sparse Autoencoders

October 2, 2024

Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models
Can Demircan, Tankred Saanum, Akshay K. Jagadish, Marcel Binz, Eric Schulz
Large Language Model Reinforcement Learning Context Learning Sparse Autoencoders

September 22, 2024

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Joseph Bloom
Sparse Autoencoders Absorption Coefficient

September 6, 2024

Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison
Multi Layer Transformer Language Model Sparse Autoencoders Transformer Layer Residual Stream

September 5, 2024

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary, Atticus Geiger
Feature Space Mechanistic Interpretability Hidden Representation Sparse Autoencoders Neural Network Baseline GPT 2

August 9, 2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda
Pre Trained Model Latent Representation Sparse Autoencoders

August 1, 2024

Disentangling Dense Embeddings with Sparse Autoencoders
Charles O'Neill, Christine Ye, Kartheik Iyer, John F. Wu
Sparse Autoencoders Semantic Fidelity Dense Embeddings Dense Text

July 31, 2024

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
Language Model Much Progress Latent Feature Dictionary Learning Sparse Autoencoders Game Model Language Model Representation

July 19, 2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda
Language Model Sparse Autoencoders Interpretability Research Independent Jump Discontinuous Function

June 25, 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
Mechanistic Interpretability Interpretable Feature Sparse Autoencoders Sparse Auto Encoders

June 17, 2024

Transcoders Find Interpretable LLM Feature Circuits
Jacob Dunefsky, Philippe Chlenski, Neel Nanda
Mechanistic Interpretability Reverse Engineering Sparse Autoencoders Circuit Analysis Interpretability Scale

June 6, 2024

May 30, 2024

A novel fault localization with data refinement for hydroelectric units
Jialong Huang, Junlin Song, Penglong Lian, Mengjie Gan, Zhiheng Su, Benhao Wang, Wenji Zhu, Xiaomin Pu, Jianxiao Zou, Shicai Fan
Manifold Learning Sparse Autoencoders Fault Localization Hydropower Plant Data Refinement

May 22, 2024

Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu
Mechanistic Interpretability Sparse Autoencoders Circuit Analysis Interpretability Scale Computation Graph Circuit Discovery

May 21, 2024

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Charles O'Neill, Thang Bui
Large Language Model Language Model Sparse Autoencoders Circuit Analysis Interpretability Scale

May 17, 2024

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
Many Sparse Mechanistic Interpretability Dictionary Learning Sparse Autoencoders Important Feature

Sparse Autoencoders

Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures

CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Residual Stream Analysis with Multi-Layer SAEs

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Disentangling Dense Embeddings with Sparse Autoencoders

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

Transcoders Find Interpretable LLM Feature Circuits

Scaling and evaluating sparse autoencoders

The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

A novel fault localization with data refinement for hydroelectric units

Automatically Identifying Local and Global Circuits with Linear Computation Graphs

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning