Paper ID: 2403.09863

Towards White Box Deep Learning

Maciej Satkiewicz

Deep neural networks learn fragile "shortcut" features, rendering them difficult to interpret (black box) and vulnerable to adversarial attacks. This paper proposes semantic features as a general architectural solution to this problem. The main idea is to make features locality-sensitive in the adequate semantic topology of the domain, thus introducing a strong regularization. The proof of concept network is lightweight, inherently interpretable and achieves almost human-level adversarial test metrics - with no adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn

Submitted: Mar 14, 2024

Topics

Deep Neural Network
Adversarial Attack
Adversarial Training
Semantic Feature
Adversarial Testing

Links

arXiv PDF