Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
vHeat: Building Vision Models upon Heat Conduction
Zhaozhi Wang, Yue Liu, Yunfan Liu, Hongtian Yu, Yaowei Wang, Qixiang Ye, Yunjie Tian
Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers
Chau Pham, Bryan A. Plummer
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
Neha Kalibhat, Priyatham Kattakinda, Arman Zarei, Nikita Seleznev, Samuel Sharpe, Senthil Kumar, Soheil Feizi
Mamba-R: Vision Mamba ALSO Needs Registers
Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie
Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference
Ting Liu, Xuyang Liu, Siteng Huang, Liangtao Shi, Zunnan Xu, Yi Xin, Quanjun Yin, Xiaohong Liu
Scalable Visual State Space Model with Fractal Scanning
Lv Tang, HaoKe Xiao, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, Bo Li
Efficient Visual State Space Model for Image Deblurring
Lingshun Kong, Jiangxin Dong, Ming-Hsuan Yang, Jinshan Pan
Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model
Yuheng Shi, Minjing Dong, Chang Xu
Configuring Data Augmentations to Reduce Variance Shift in Positional Embedding of Vision Transformers
Bum Jun Kim, Sang Woo Kim
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
Anthony Fuller, Daniel G. Kyrollos, Yousef Yassin, James R. Green
DCT-Based Decorrelated Attention for Vision Transformers
Hongyi Pan, Emadeldeen Hamdan, Xin Zhu, Koushik Biswas, Ahmet Enis Cetin, Ulas Bagci
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He
Vision Transformer with Sparse Scan Prior
Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He