Interpretable Algorithm
Interpretable algorithms aim to make the decision-making processes of machine learning models transparent and understandable, enhancing trust and accountability. Current research focuses on developing methods to reverse-engineer existing models (e.g., transformers) to reveal their underlying algorithms, as well as designing inherently interpretable models from the outset, often employing techniques like concept bottlenecks and incorporating prior knowledge through optimization frameworks. This field is crucial for building trust in AI systems across various domains, particularly in high-stakes applications like healthcare, where understanding model decisions is paramount for responsible deployment.
Papers
Can LLMs faithfully generate their layperson-understandable 'self'?: A Case Study in High-Stakes Domains
Arion Das, Asutosh Mishra, Amitesh Patel, Soumilya De, V. Gurucharan, Kripabandhu Ghosh
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
Jatin Nainani, Sankaran Vaidyanathan, AJ Yeung, Kartik Gupta, David Jensen