Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
ChangeViT: Unleashing Plain Vision Transformers for Change Detection
Duowang Zhu, Xiaohu Huang, Haiyan Huang, Zhenfeng Shao, Qimin Cheng
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models
Hengyi Wang, Shiwei Tan, Hao Wang
Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers
Chaitanya Devaguptapu, Sumukh Aithal, Shrinivas Ramasubramanian, Moyuru Yamada, Manohar Kaul
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen
MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction
Lianwei Yang, Zhikai Li, Junrui Xiao, Haisong Gong, Qingyi Gu
Vision Transformer Segmentation for Visual Bird Sound Denoising
Sahil Kumar, Jialu Li, Youshan Zhang
Fusion of regional and sparse attention in Vision Transformers
Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara
OT-VP: Optimal Transport-guided Visual Prompting for Test-Time Adaptation
Yunbei Zhang, Akshay Mehra, Jihun Hamm
AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer
Yitao Xu, Tong Zhang, Sabine Süsstrunk
Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking
Xiangyang Yang, Dan Zeng, Xucheng Wang, You Wu, Hengzhou Ye, Qijun Zhao, Shuiwang Li
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma
A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis
Leonardo Scabini, Andre Sacilotti, Kallil M. Zielinski, Lucas C. Ribas, Bernard De Baets, Odemir M. Bruno
Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control
Dongyoon Hwang, Byungkun Lee, Hojoon Lee, Hyunseung Kim, Jaegul Choo