Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
1550papers
Papers - Page 16
September 25, 2024
September 24, 2024
September 23, 2024
September 22, 2024
September 21, 2024
September 20, 2024
ViTGuard: Attention-aware Detection against Adversarial Examples for Vision Transformer
OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition
Boosting Federated Domain Generalization: The Role of Advanced Pre-Trained Architectures
DS2TA: Denoising Spiking Transformer with Attenuated Spatiotemporal Attention
September 18, 2024
September 16, 2024