Sparse Autoencoders
Sparse autoencoders (SAEs) are unsupervised machine learning models designed to extract interpretable features from complex neural network activations by learning sparse, low-dimensional representations of high-dimensional data. Current research focuses on applying SAEs to understand the inner workings of large language models and other deep learning architectures, employing variations like JumpReLU and Gated SAEs to improve reconstruction fidelity and sparsity. This work is significant for advancing mechanistic interpretability, enabling better understanding of model behavior and potentially leading to improved model control and more reliable applications in diverse fields like healthcare and scientific discovery.
Papers
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan, Philip Torr, Austin Meek, Ashkan Khakzar, David Krueger, Fazl Barez
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures
Junxuan Wang, Xuyang Ge, Wentao Shu, Qiong Tang, Yunhua Zhou, Zhengfu He, Xipeng Qiu