Mechanistic Interpretability
Mechanistic interpretability aims to understand how neural networks, particularly large language models (LLMs) and image models, perform computations by reverse-engineering their internal mechanisms. Current research focuses on identifying and characterizing "circuits"—minimal subnetworks responsible for specific tasks—within transformer architectures, often using techniques like sparse autoencoders and activation patching to analyze neuron activations and attention mechanisms. This work is crucial for improving model reliability, safety, and trustworthiness, as well as for gaining fundamental insights into the nature of artificial intelligence and its relationship to human cognition.
Papers
December 23, 2024
December 20, 2024
December 17, 2024
December 5, 2024
November 25, 2024
November 17, 2024
October 30, 2024
October 24, 2024
October 22, 2024
October 21, 2024
October 15, 2024
October 10, 2024
October 9, 2024
October 8, 2024
October 7, 2024
October 2, 2024
September 23, 2024