Mechanistic Interpretability
Mechanistic interpretability aims to understand how neural networks, particularly large language models (LLMs) and image models, perform computations by reverse-engineering their internal mechanisms. Current research focuses on identifying and characterizing "circuits"—minimal subnetworks responsible for specific tasks—within transformer architectures, often using techniques like sparse autoencoders and activation patching to analyze neuron activations and attention mechanisms. This work is crucial for improving model reliability, safety, and trustworthiness, as well as for gaining fundamental insights into the nature of artificial intelligence and its relationship to human cognition.
Papers
December 14, 2023
December 5, 2023
November 28, 2023
November 25, 2023
November 1, 2023
October 12, 2023
September 27, 2023
September 15, 2023
September 5, 2023
August 27, 2023
July 11, 2023
June 7, 2023
June 1, 2023
May 24, 2023
May 4, 2023
April 30, 2023
April 29, 2023
April 28, 2023
April 22, 2023