Novel Vision Transformer

Novel Vision Transformers (ViTs) aim to improve upon the limitations of traditional ViTs, primarily their reliance on fixed-size patch partitioning and the resulting disruption of image context. Current research focuses on developing architectures that adapt to image content, such as using superpixels or learned patterns as input tokens, and incorporating convolutional layers to better handle local information alongside global dependencies. These advancements lead to improved performance on various computer vision tasks, including image classification, object detection, and semantic segmentation, and offer enhanced interpretability and efficiency compared to earlier models.

Papers