Sparse Autoencoders
Sparse autoencoders (SAEs) are unsupervised machine learning models designed to extract interpretable features from complex neural network activations by learning sparse, low-dimensional representations of high-dimensional data. Current research focuses on applying SAEs to understand the inner workings of large language models and other deep learning architectures, employing variations like JumpReLU and Gated SAEs to improve reconstruction fidelity and sparsity. This work is significant for advancing mechanistic interpretability, enabling better understanding of model behavior and potentially leading to improved model control and more reliable applications in diverse fields like healthcare and scientific discovery.
Papers
January 9, 2025
December 20, 2024
December 16, 2024
December 10, 2024
December 9, 2024
December 6, 2024
December 5, 2024
December 3, 2024
November 28, 2024
November 23, 2024
November 22, 2024
November 21, 2024
November 20, 2024
November 18, 2024
November 15, 2024
November 13, 2024
November 7, 2024
November 4, 2024