Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Class-Discriminative Attention Maps for Vision Transformers
Lennart Brocki, Jakub Binda, Neo Christopher Chung
DiffiT: Diffusion Vision Transformers for Image Generation
Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat
SRTransGAN: Image Super-Resolution using Transformer based Generative Adversarial Network
Neeraj Baghel, Shiv Ram Dubey, Satish Kumar Singh
Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
Min Yang, Huan Gao, Ping Guo, Limin Wang
MobileUtr: Revisiting the relationship between light-weight CNN and Transformer for efficient medical image segmentation
Fenghe Tang, Bingkun Nian, Jianrui Ding, Quan Quan, Jie Yang, Wei Liu, S. Kevin Zhou
A Comprehensive Study of Vision Transformers in Image Classification Tasks
Mahmoud Khalil, Ahmad Khalil, Alioune Ngom
USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery
Jeremy Irvin, Lucas Tao, Joanne Zhou, Yuntao Ma, Langston Nashold, Benjamin Liu, Andrew Y. Ng
Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning
Utku Mert Topcuoglu, Erdem Akagündüz
Token Fusion: Bridging the Gap between Token Pruning and Token Merging
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin
Improve Supervised Representation Learning with Masked Image Modeling
Kaifeng Chen, Daniel Salz, Huiwen Chang, Kihyuk Sohn, Dilip Krishnan, Mojtaba Seyedhosseini
Generative Parameter-Efficient Fine-Tuning
Chinmay Savadikar, Xi Song, Tianfu Wu
A Recent Survey of Vision Transformers for Medical Image Segmentation
Asifullah Khan, Zunaira Rauf, Abdul Rehman Khan, Saima Rathore, Saddam Hussain Khan, Najmus Saher Shah, Umair Farooq, Hifsa Asif, Aqsa Asif, Umme Zahoora, Rafi Ullah Khalil, Suleman Qamar, Umme Hani Asif, Faiza Babar Khan, Abdul Majid, Jeonghwan Gwak
Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach
Yuxin Li, Qiang Han, Mengying Yu, Yuxin Jiang, Chaikiat Yeo, Yiheng Li, Zihang Huang, Nini Liu, Hsuanhan Chen, Xiaojun Wu
SCHEME: Scalable Channel Mixer for Vision Transformers
Deepak Sridhar, Yunsheng Li, Nuno Vasconcelos
Improving Interpretation Faithfulness for Vision Transformers
Lijie Hu, Yixin Liu, Ninghao Liu, Mengdi Huai, Lichao Sun, Di Wang
GeoDeformer: Geometric Deformable Transformer for Action Recognition
Jinhui Ye, Jiaming Zhou, Hui Xiong, Junwei Liang
PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with Confidence-Level Prediction and Pose Tokens
Sebastian Stapf, Tobias Bauernfeind, Marco Riboldi