Sparse Autoencoders
Sparse autoencoders (SAEs) are unsupervised machine learning models designed to extract interpretable features from complex neural network activations by learning sparse, low-dimensional representations of high-dimensional data. Current research focuses on applying SAEs to understand the inner workings of large language models and other deep learning architectures, employing variations like JumpReLU and Gated SAEs to improve reconstruction fidelity and sparsity. This work is significant for advancing mechanistic interpretability, enabling better understanding of model behavior and potentially leading to improved model control and more reliable applications in diverse fields like healthcare and scientific discovery.
Papers
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi, Federico Belotti, Marco Molinari
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre
The Persian Rug: solving toy models of superposition using large-scale symmetries
Aditya Cowsik, Kfir Dolev, Alex Infanger
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon, Manish Shrivastava, David Krueger, Ekdeep Singh Lubana
Can sparse autoencoders make sense of latent representations?
Viktoria Schuster
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde, Michael T. Pearce, Lee Sharkey