Plain Vision Transformer
Plain Vision Transformers (ViTs) are a class of deep learning models that apply the transformer architecture directly to image data, aiming for simplicity and generalizability compared to more complex, hierarchical designs. Current research focuses on improving their efficiency and performance in various tasks, including semantic segmentation, change detection, and anomaly detection, often through techniques like adaptive token merging, dynamic token pruning, and novel decoder architectures. These efforts are significant because they explore the potential of simpler, more scalable models, potentially leading to more efficient and broadly applicable solutions in computer vision.
Papers
August 1, 2024
June 18, 2024
June 14, 2024
December 12, 2023
October 19, 2023
August 2, 2023
June 30, 2023
June 9, 2023
May 24, 2023
May 2, 2023
April 3, 2023
November 3, 2022
October 12, 2022
August 8, 2022
May 30, 2022
May 3, 2022
April 26, 2022