Inherent Interpretability
Inherent interpretability in machine learning focuses on designing models and methods that are inherently transparent and understandable, aiming to reduce the "black box" nature of many AI systems. Current research emphasizes developing intrinsically interpretable model architectures, such as those based on decision trees, rule-based systems, and specific neural network designs (e.g., Kolmogorov-Arnold Networks), alongside techniques like feature attribution and visualization methods to enhance understanding of model behavior. This pursuit is crucial for building trust in AI, particularly in high-stakes applications like healthcare and finance, where understanding model decisions is paramount for responsible deployment and effective human-AI collaboration.
Papers
On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity
Vincent Szolnoky, Viktor Andersson, Balazs Kulcsar, Rebecka Jörnsten
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI
Suzanna Sia, Anton Belyy, Amjad Almahairi, Madian Khabsa, Luke Zettlemoyer, Lambert Mathias
Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer
Javier Ferrando, Gerard I. Gállego, Belen Alastruey, Carlos Escolano, Marta R. Costa-jussà
A Fine-grained Interpretability Evaluation Benchmark for Neural NLP
Lijie Wang, Yaozong Shen, Shuyuan Peng, Shuai Zhang, Xinyan Xiao, Hao Liu, Hongxuan Tang, Ying Chen, Hua Wu, Haifeng Wang
B-cos Networks: Alignment is All We Need for Interpretability
Moritz Böhle, Mario Fritz, Bernt Schiele
Constructive Interpretability with CoLabel: Corroborative Integration, Complementary Features, and Collaborative Learning
Abhijit Suprem, Sanjyot Vaidya, Suma Cherkadi, Purva Singh, Joao Eduardo Ferreira, Calton Pu