Efficient Vision Transformer

Efficient Vision Transformers (ViTs) aim to overcome the computational limitations of standard ViTs while maintaining their strong performance in computer vision tasks. Current research focuses on developing novel attention mechanisms (e.g., polynomial attention), token reduction strategies (e.g., learnable token merging, dynamic token idling), and adaptive computation techniques to reduce the number of tokens processed based on image complexity. These advancements are significant because they enable the deployment of ViTs in resource-constrained environments like mobile devices and embedded systems, broadening their applicability in various fields.

Papers