Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
On the unreasonable vulnerability of transformers for image restoration -- and an easy fix
Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Julia Grabinski, Paramanand Chandramouli, Margret Keuper
Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition
Cheng Da, Peng Wang, Cong Yao
Learned Thresholds Token Merging and Pruning for Vision Transformers
Maxim Bonnaerens, Joni Dambre
Reverse Knowledge Distillation: Training a Large Model using a Small One for Retinal Image Matching on Limited Data
Sahar Almahfouz Nasser, Nihar Gupte, Amit Sethi
Quantized Feature Distillation for Network Quantization
Ke Zhu, Yin-Yin He, Jianxin Wu
Light-Weight Vision Transformer with Parallel Local and Global Self-Attention
Nikolas Ebert, Laurenz Reichardt, Didier Stricker, Oliver Wasenmüller
R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut
Yingjie Niu, Ming Ding, Maoning Ge, Robin Karlsson, Yuxiao Zhang, Kazuya Takeda
Human Action Recognition in Still Images Using ConViT
Seyed Rohollah Hosseyni, Sanaz Seyedin, Hasan Taheri
Scale-Aware Modulation Meet Transformer
Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, Lianwen Jin
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang
Study of Vision Transformers for Covid-19 Detection from Chest X-rays
Sandeep Angara, Sharath Thirunagaru
Cumulative Spatial Knowledge Distillation for Vision Transformers
Borui Zhao, Renjie Song, Jiajun Liang
TALL: Thumbnail Layout for Deepfake Video Detection
Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, Ran He
HEAL-SWIN: A Vision Transformer On The Sphere
Oscar Carlsson, Jan E. Gerken, Hampus Linander, Heiner Spieß, Fredrik Ohlsson, Christoffer Petersson, Daniel Persson
MaxSR: Image Super-Resolution Using Improved MaxViT
Bincheng Yang, Gangshan Wu