Interpretability Research
Interpretability research aims to understand how and why machine learning models, particularly deep neural networks like transformers, make predictions. Current efforts focus on identifying crucial subnetworks (circuits) within models, developing improved methods for causal analysis (e.g., optimal ablation and counterfactual methods), and evaluating the effectiveness of various explanation techniques (e.g., through human studies and automated metrics). This research is vital for building trust in AI systems, improving model design, and facilitating the responsible deployment of AI in high-stakes applications.
Papers
October 17, 2023
October 16, 2023
October 2, 2023
September 29, 2023
June 21, 2023
May 31, 2023
April 28, 2023
December 18, 2022
July 27, 2022
March 27, 2022
February 9, 2022
December 6, 2021