Explanation Regularization

Explanation Regularization (ER) is a machine learning technique aiming to improve model performance and interpretability by aligning model predictions with human-understandable explanations, often called rationales. Current research focuses on developing ER methods for various tasks, including text classification and analog neural networks, exploring different approaches to incorporate these rationales into the training process, such as using differentiable rationale extractors or consensus-based regularization. The impact of ER is being evaluated across in-distribution and out-of-distribution settings, with a growing emphasis on understanding its effect on model robustness and generalization capabilities, ultimately contributing to more reliable and trustworthy AI systems.

Papers