Paper ID: 2402.03317

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

Xixu Hu, Runkai Zheng, Jindong Wang, Cheuk Hang Leung, Qi Wu, Xing Xie

Vision Transformers (ViTs) have gained prominence as a preferred choice for a wide range of computer vision tasks due to their exceptional performance. However, their widespread adoption has raised concerns about security in the face of malicious attacks. Most existing methods rely on empirical adjustments during the training process, lacking a clear theoretical foundation. In this study, we address this gap by introducing SpecFormer, specifically designed to enhance ViTs' resilience against adversarial attacks, with support from carefully derived theoretical guarantees. We establish local Lipschitz bounds for the self-attention layer and introduce a novel approach, Maximum Singular Value Penalization (MSVP), to attain precise control over these bounds. We seamlessly integrate MSVP into ViTs' attention layers, using the power iteration method for enhanced computational efficiency. The modified model, SpecFormer, effectively reduces the spectral norms of attention weight matrices, thereby enhancing network local Lipschitzness. This, in turn, leads to improved training efficiency and robustness. Extensive experiments on CIFAR and ImageNet datasets confirm SpecFormer's superior performance in defending against adversarial attacks.

Submitted: Jan 2, 2024