Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review
Sonia Bbouzidi, Ghazala Hcini, Imen Jdey, Fadoua Drira
SuperFormer: Volumetric Transformer Architectures for MRI Super-Resolution
Cristhian Forigua, Maria Escobar, Pablo Arbelaez
Learning Visual Prompts for Guiding the Attention of Vision Transformers
Razieh Rezaei, Masoud Jalili Sabet, Jindong Gu, Daniel Rueckert, Philip Torr, Ashkan Khakzar
Scaling White-Box Transformers for Vision
Jinrui Yang, Xianhang Li, Druv Pai, Yuyin Zhou, Yi Ma, Yaodong Yu, Cihang Xie
Sharing Key Semantics in Transformer Makes Efficient Image Restoration
Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe
P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer
Huihong Shi, Xin Cheng, Wendong Mao, Zhongfeng Wang
MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer
Polezhaev Ignat, Goncharenko Igor, Iurina Natalya
PanoNormal: Monocular Indoor 360{\deg} Surface Normal Estimation
Kun Huang, Fanglue Zhang, Neil Dodgson
Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain
Juntao Zhang, Kun Bian, Peng Cheng, Wenbo An, Jianning Liu, Jun Zhou
Wavelet-Based Image Tokenizer for Vision Transformers
Zhenhai Zhu, Radu Soricut
ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, Chang Huang
MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution
Wenzhuo Liu, Fei Zhu, Shijie Ma, Cheng-Lin Liu
Visualizing the loss landscape of Self-supervised Vision Transformer
Youngwan Lee, Jeffrey Ryan Willette, Jonghee Kim, Sung Ju Hwang
Efficient Time Series Processing for Transformers and State-Space Models through Token Merging
Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn
Near-Infrared and Low-Rank Adaptation of Vision Transformers in Remote Sensing
Irem Ulku, O. Ozgur Tanriover, Erdem Akagündüz