Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
DINO-CXR: A self supervised method based on vision transformer for chest X-ray classification
Mohammadreza Shakouri, Fatemeh Iranmanesh, Mahdi Eftekhari
ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data
Ruiqi Yang, Eric Modesitt
FLatten Transformer: Vision Transformer using Focused Linear Attention
Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, Gao Huang
Improving Pixel-based MIM by Reducing Wasted Modeling Capability
Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin
LGViT: Dynamic Early Exiting for Accelerating Vision Transformer
Guanyu Xu, Jiawei Hao, Li Shen, Han Hu, Yong Luo, Hui Lin, Jialie Shen
IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer
Xiaochen Ma, Bo Du, Zhuohang Jiang, Xia Du, Ahmed Y. Al Hammadi, Jizhe Zhou
Pre-training Vision Transformers with Very Limited Synthesized Images
Ryo Nakamura, Hirokatsu Kataoka, Sora Takashima, Edgar Josafat Martinez Noriega, Rio Yokota, Nakamasa Inoue
MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation
Reiner Birkl, Diana Wofk, Matthias Müller
Sparse Double Descent in Vision Transformers: real or phantom threat?
Victor Quétu, Marta Milovanovic, Enzo Tartaglione
Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models
Ryota Iijima, Miki Tanaka, Sayaka Shiota, Hitoshi Kiya
AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets
Siyi Du, Nourhan Bayasi, Ghassan Hamarneh, Rafeef Garbi