Plain Vision Transformer

Plain Vision Transformers (ViTs) are a class of deep learning models that apply the transformer architecture directly to image data, aiming for simplicity and generalizability compared to more complex, hierarchical designs. Current research focuses on improving their efficiency and performance in various tasks, including semantic segmentation, change detection, and anomaly detection, often through techniques like adaptive token merging, dynamic token pruning, and novel decoder architectures. These efforts are significant because they explore the potential of simpler, more scalable models, potentially leading to more efficient and broadly applicable solutions in computer vision.

Papers