Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Is Attentional Channel Processing Design Required? Comprehensive Analysis Of Robustness Between Vision Transformers And Fully Attentional Networks
Abhishri Ajit Medewar, Swanand Ashokrao Kavitkar
Improving Visual Prompt Tuning for Self-supervised Vision Transformers
Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, Sungroh Yoon
Multi-Scale And Token Mergence: Make Your ViT More Efficient
Zhe Bian, Zhe Wang, Wenqiang Han, Kangping Wang
Revising deep learning methods in parking lot occupancy detection
Anastasia Martynova, Mikhail Kuznetsov, Vadim Porvatov, Vladislav Tishin, Andrey Kuznetsov, Natalia Semenova, Ksenia Kuznetsova
Efficient Vision Transformer for Human Pose Estimation via Patch Selection
Kaleab A. Kinfu, Rene Vidal
TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for Medical Image Segmentation
Rui Sun, Tao Lei, Weichuan Zhang, Yong Wan, Yong Xia, Asoke K. Nandi
Human-imperceptible, Machine-recognizable Images
Fusheng Hao, Fengxiang He, Yikai Wang, Fuxiang Wu, Jing Zhang, Jun Cheng, Dacheng Tao
CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation
Tao Lei, Rui Sun, Xuan Wang, Yingbo Wang, Xi He, Asoke Nandi
Centered Self-Attention Layers
Ameen Ali, Tomer Galanti, Lior Wolf
A Novel Vision Transformer with Residual in Self-attention for Biomedical Image Classification
Arun K. Sharma, Nishchal K. Verma
nnMobileNet: Rethinking CNN for Retinopathy Research
Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xin Li, Natasha Lepore, Oana M. Dumitrascu, Yalin Wang
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
Lightweight Vision Transformer with Bidirectional Interaction
Qihang Fan, Huaibo Huang, Xiaoqiang Zhou, Ran He
Vision Transformers for Mobile Applications: A Short Survey
Nahid Alam, Steven Kolawole, Simardeep Sethi, Nishant Bansali, Karina Nguyen
Prompt-Based Tuning of Transformer Models for Multi-Center Medical Image Segmentation of Head and Neck Cancer
Numan Saeed, Muhammad Ridzuan, Roba Al Majzoub, Mohammad Yaqub
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts
Rishov Sarkar, Hanxue Liang, Zhiwen Fan, Zhangyang Wang, Cong Hao