Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples
Tian Gao, Cheng-Zhong Xu, Le Zhang, Hui Kong
M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision Transformer
Yunsheng Ma, Liangqi Yuan, Amr Abdelraouf, Kyungtae Han, Rohit Gupta, Zihao Li, Ziran Wang
Salient Mask-Guided Vision Transformer for Fine-Grained Classification
Dmitry Demidov, Muhammad Hamza Sharif, Aliakbar Abdurahimov, Hisham Cholakkal, Fahad Shahbaz Khan
EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Dahun Kim, Anelia Angelova, Weicheng Kuo
Patch-wise Mixed-Precision Quantization of Vision Transformer
Junrui Xiao, Zhikai Li, Lianwei Yang, Qingyi Gu
Joint Moment Retrieval and Highlight Detection Via Natural Language Queries
Richard Luo, Austin Peng, Heidi Yap, Koby Beard
BiRT: Bio-inspired Replay in Vision Transformers for Continual Learning
Kishaan Jeeveswaran, Prashant Bhat, Bahram Zonooz, Elahe Arani
Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields
Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Sang Woo Kim
Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting
Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu
FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing
Ajian Liu, Zichang Tan, Zitong Yu, Chenxu Zhao, Jun Wan, Yanyan Liang, Zhen Lei, Du Zhang, Stan Z. Li, Guodong Guo
Semantic Segmentation using Vision Transformers: A survey
Hans Thisanke, Chamli Deshan, Kavindu Chamith, Sachith Seneviratne, Rajith Vidanaarachchi, Damayanthi Herath
Reduction of Class Activation Uncertainty with Background Information
H M Dipu Kabir