Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens
Rizhao Cai, Zitong Yu, Chenqi Kong, Haoliang Li, Changsheng Chen, Yongjian Hu, Alex Kot
DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions
Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, Zhaoxiang Zhang
Compressing Vision Transformers for Low-Resource Visual Learning
Eric Youn, Sai Mitheran J, Sanjana Prabhu, Siyuan Chen
Domain Adaptation for Efficiently Fine-tuning Vision Transformer with Encrypted Images
Teru Nagamori, Sayaka Shiota, Hitoshi Kiya
A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking
Lorenzo Papa, Paolo Russo, Irene Amerini, Luping Zhou
Mask-Attention-Free Transformer for 3D Instance Segmentation
Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, Jiaya Jia
Locality-Aware Hyperspectral Classification
Fangqin Zhou, Mert Kilickaya, Joaquin Vanschoren
Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization
Yiwen Cao, Yukun Su, Wenjun Wang, Yanxia Liu, Qingyao Wu
ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer
Gyeongdong Yang, Yungwook Kwon, Hyunjin Kim