Local Self Attention

Local self-attention mechanisms in neural networks aim to efficiently capture contextual information within limited spatial or temporal neighborhoods, addressing the computational burden of global self-attention. Current research focuses on improving the efficiency and effectiveness of local self-attention within various architectures, including vision transformers (e.g., Swin Transformer, its variants, and other hierarchical ViTs) and other models applied to image processing, video analysis, and point cloud understanding. These advancements are significant because they enable the application of transformer-based models to high-resolution data and large-scale tasks where global self-attention is computationally prohibitive, leading to improved performance in various applications like image classification, object detection, and video understanding.

Papers