Mechanistic Interpretability
Mechanistic interpretability aims to understand how neural networks, particularly large language models (LLMs) and image models, perform computations by reverse-engineering their internal mechanisms. Current research focuses on identifying and characterizing "circuits"—minimal subnetworks responsible for specific tasks—within transformer architectures, often using techniques like sparse autoencoders and activation patching to analyze neuron activations and attention mechanisms. This work is crucial for improving model reliability, safety, and trustworthiness, as well as for gaining fundamental insights into the nature of artificial intelligence and its relationship to human cognition.
Papers
September 15, 2023
September 5, 2023
August 27, 2023
July 11, 2023
June 7, 2023
June 1, 2023
May 24, 2023
May 4, 2023
April 30, 2023
April 29, 2023
April 28, 2023
April 22, 2023
April 19, 2023
February 6, 2023
January 12, 2023
November 22, 2022
November 1, 2022
September 21, 2022