Bilateral Local Attention Vision Transformer

Bilateral Local Attention Vision Transformers (BLATs) aim to improve the efficiency and effectiveness of Vision Transformers (ViTs) by strategically limiting the scope of attention mechanisms. Current research focuses on developing architectures that combine local attention within both image space (e.g., using sliding windows) and feature space (e.g., clustering similar features), thereby capturing both short-range and long-range dependencies more efficiently than global attention. This approach leads to improved performance on various computer vision tasks, such as video frame interpolation, object segmentation, and moment retrieval, while reducing computational costs associated with large image inputs. The resulting models offer a compelling alternative to traditional convolutional neural networks and are impacting the development of more efficient and powerful vision systems.

Papers