Paper ID: 2410.22709

FilterViT and DropoutViT: Lightweight Vision Transformer Models for Efficient Attention Mechanisms

Bohang Sun (School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China)

In this study, we introduce FilterViT, an enhanced version of MobileViT, which leverages an attention-based mechanism for early-stage downsampling. Traditional QKV operations on high-resolution feature maps are computationally intensive due to the abundance of tokens. To address this, we propose a filter attention mechanism using a convolutional neural network (CNN) to generate an importance mask, focusing attention on key image regions. The method significantly reduces computational complexity while maintaining interpretability, as it highlights essential image areas. Experimental results show that FilterViT achieves substantial gains in both efficiency and accuracy compared to other models. We also introduce DropoutViT, a variant that uses a stochastic approach for pixel selection, further enhancing robustness.

Submitted: Oct 30, 2024