Lightweight Vision Transformer

Lightweight Vision Transformers (ViTs) aim to reduce the computational cost and memory footprint of standard ViTs, making them suitable for resource-constrained devices while maintaining competitive performance. Current research focuses on improving efficiency through novel architectures like latency-aware blocks incorporating convolutions and sparse self-attention, and leveraging pre-training techniques such as masked image modeling and knowledge distillation to enhance performance on limited data. These advancements are significant because they enable the deployment of powerful transformer-based models in mobile and edge computing applications, expanding the reach of advanced computer vision capabilities.

Papers