Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
PoseViNet: Distracted Driver Action Recognition Framework Using Multi-View Pose Estimation and Vision Transformer
Neha Sengar, Indra Kumari, Jihui Lee, Dongsoo Har
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Ehsan Abbasnejad, Hamed Damirchi, Ignacio M. Jara, Felipe Bravo-Marquez, Anton van den Hengel
Hierarchical Vision Transformers for Context-Aware Prostate Cancer Grading in Whole Slide Images
Clément Grisi, Geert Litjens, Jeroen van der Laak
Integrating Human Vision Perception in Vision Transformers for Classifying Waste Items
Akshat Kishore Shrivastava, Tapan Kumar Gandhi
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, Zechao Li
Weight subcloning: direct initialization of transformers using larger pretrained ones
Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery
Zimian Wei, Lujun Li, Peijie Dong, Zheng Hui, Anggeng Li, Menglong Lu, Hengyue Pan, Zhiliang Tian, Dongsheng Li
Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost
Haolin Qin, Daquan Zhou, Tingfa Xu, Ziyang Bian, Jianan Li
Kraken: enabling joint trajectory prediction by utilizing Mode Transformer and Greedy Mode Processing
Daniil S. Antonenko, Stepan Konev, Yuriy Biktairov, Boris Yangel
MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness
Xiaoyun Xu, Shujian Yu, Zhuoran Liu, Stjepan Picek
Adapting Vision Transformer for Efficient Change Detection
Yang Zhao, Yuxiang Zhang, Yanni Dong, Bo Du
A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement
Risab Biswas, Swalpa Kumar Roy, Umapada Pal
When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology
Wenhui Wang, Shuming Ma, Hanwen Xu, Naoto Usuyama, Jiayu Ding, Hoifung Poon, Furu Wei