Fast Vision Transformer

Fast Vision Transformers (FViTs) aim to overcome the computational limitations of standard Vision Transformers (ViTs) while maintaining high accuracy for computer vision tasks. Research focuses on developing efficient architectures through techniques like hierarchical attention, optimized attention mechanisms (e.g., incorporating Gabor filters or cascaded group attention), and generative architecture search, leading to models such as FasterViT, TurboViT, and EfficientViT. These advancements enable faster inference speeds and lower memory consumption, making ViTs more suitable for real-time applications in robotics, mobile devices, and other resource-constrained environments.

Papers