Mechanistic Interpretability
Mechanistic interpretability aims to understand how neural networks, particularly large language models (LLMs) and image models, perform computations by reverse-engineering their internal mechanisms. Current research focuses on identifying and characterizing "circuits"—minimal subnetworks responsible for specific tasks—within transformer architectures, often using techniques like sparse autoencoders and activation patching to analyze neuron activations and attention mechanisms. This work is crucial for improving model reliability, safety, and trustworthiness, as well as for gaining fundamental insights into the nature of artificial intelligence and its relationship to human cognition.
Papers
April 23, 2024
April 22, 2024
April 9, 2024
April 4, 2024
February 19, 2024
February 16, 2024
February 7, 2024
February 6, 2024
February 2, 2024
January 31, 2024
January 22, 2024
January 8, 2024
December 26, 2023
December 14, 2023
December 5, 2023
November 28, 2023
November 25, 2023
November 1, 2023
October 12, 2023