Non Hierarchical Vision Transformer

Non-hierarchical Vision Transformers (ViTs) represent a simplified approach to computer vision, aiming to achieve high performance with less complex architectures than traditional hierarchical models. Current research focuses on adapting these plain ViTs for various tasks, including semantic segmentation, object detection, and pose estimation, often employing minimal modifications like simple feature pyramids or lightweight decoders. This streamlined approach offers advantages in efficiency and transferability, potentially leading to faster and more adaptable vision systems for diverse applications, as demonstrated by their competitive performance in several benchmark datasets.

Papers