Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers
Daqian Shao, Lukas Fesser, Marta Kwiatkowska
TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Dai Shi
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
Spiking Neural Networks with Dynamic Time Steps for Vision Transformers
Gourav Datta, Zeyu Liu, Anni Li, Peter A. Beerel
Typhoon Intensity Prediction with Vision Transformer
Huanxin Chen, Pengshuai Yin, Huichou Huang, Qingyao Wu, Ruirui Liu, Xiatian Zhu
Spectro-ViT: A Vision Transformer Model for GABA-edited MRS Reconstruction Using Spectrograms
Gabriel Dias, Rodrigo Pommot Berto, Mateus Oliveira, Lucas Ueda, Sergio Dertkigil, Paula D. P. Costa, Amirmohammad Shamaei, Roberto Souza, Ashley Harris, Leticia Rittner
TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration
Jan Olszewski, Dawid Rymarczyk, Piotr Wójcik, Mateusz Pach, Bartosz Zieliński
Advancing Vision Transformers with Group-Mix Attention
Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo
Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches
Quazi Mishkatul Alam, Bilel Tarchoun, Ihsen Alouani, Nael Abu-Ghazaleh
Improving Source-Free Target Adaptation with Vision Transformers Leveraging Domain Representation Images
Gauransh Sawhney, Daksh Dave, Adeel Ahmed, Jiechao Gao, Khalid Saleem
Efficient Rotation Invariance in Deep Neural Networks through Artificial Mental Rotation
Lukas Tuggener, Thilo Stadelmann, Jürgen Schmidhuber
MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis
Yitao Zhu, Zhenrong Shen, Zihao Zhao, Sheng Wang, Xin Wang, Xiangyu Zhao, Dinggang Shen, Qian Wang