Mobile Vision Transformer

Mobile Vision Transformers (MobileViTs) aim to adapt the powerful Vision Transformer (ViT) architecture for resource-constrained mobile devices, addressing the computational demands of standard ViTs. Current research focuses on developing lightweight MobileViT architectures through techniques like efficient quantization (e.g., 1-bit or ternary quantization), optimized self-attention mechanisms (e.g., separable self-attention), and novel model designs tailored for specific mobile vision tasks such as depth estimation, object tracking, and instance segmentation. This work is significant because it enables the deployment of advanced vision capabilities on mobile devices, impacting applications ranging from augmented reality to autonomous systems.

Papers