Interpretability Research
Interpretability research aims to understand how and why machine learning models, particularly deep neural networks like transformers, make predictions. Current efforts focus on identifying crucial subnetworks (circuits) within models, developing improved methods for causal analysis (e.g., optimal ablation and counterfactual methods), and evaluating the effectiveness of various explanation techniques (e.g., through human studies and automated metrics). This research is vital for building trust in AI systems, improving model design, and facilitating the responsible deployment of AI in high-stakes applications.
Papers
November 11, 2024
October 25, 2024
October 18, 2024
October 17, 2024
October 9, 2024
October 7, 2024
October 2, 2024
September 16, 2024
August 22, 2024
August 2, 2024
July 19, 2024
July 15, 2024
July 5, 2024
April 22, 2024
March 26, 2024
March 19, 2024
February 23, 2024
February 18, 2024
December 11, 2023