Interpretability Research

Interpretability research aims to understand how and why machine learning models, particularly deep neural networks like transformers, make predictions. Current efforts focus on identifying crucial subnetworks (circuits) within models, developing improved methods for causal analysis (e.g., optimal ablation and counterfactual methods), and evaluating the effectiveness of various explanation techniques (e.g., through human studies and automated metrics). This research is vital for building trust in AI systems, improving model design, and facilitating the responsible deployment of AI in high-stakes applications.

Papers