Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Fully Attentional Networks with Self-emerging Token Labeling
Bingyin Zhao, Zhiding Yu, Shiyi Lan, Yutao Cheng, Anima Anandkumar, Yingjie Lao, Jose M. Alvarez
LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition
Youbing Hu, Yun Cheng, Anqi Lu, Zhiqiang Cao, Dawei Wei, Jie Liu, Zhijun Li
Denoising Vision Transformers
Jiawei Yang, Katie Z Luo, Jiefeng Li, Kilian Q Weinberger, Yonglong Tian, Yue Wang
SPFormer: Enhancing Vision Transformer with Superpixel Representation
Jieru Mei, Liang-Chieh Chen, Alan Yuille, Cihang Xie
CrisisViT: A Robust Vision Transformer for Crisis Image Classification
Zijun Long, Richard McCreadie, Muhammad Imran
A Random Ensemble of Encrypted models for Enhancing Robustness against Adversarial Examples
Ryota Iijima, Sayaka Shiota, Hitoshi Kiya