Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
LaVin-DiT: Large Vision Diffusion Transformer
Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu
Lung Disease Detection with Vision Transformers: A Comparative Study of Machine Learning Methods
Baljinnyam Dayan
DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery
Jaewoo Heo, George Hu, Zeyu Wang, Serena Yeung-Levy
Learning Parameter Sharing with Tensor Decompositions and Sparsity
Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou
On the Surprising Effectiveness of Attention Transfer for Vision Transformers
Alexander C. Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen
Assessing the Performance of the DINOv2 Self-supervised Learning Vision Transformer Model for the Segmentation of the Left Atrium from MRI Images
Bipasha Kundu, Bidur Khanal, Richard Simon, Cristian A. Linte
SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers
Shravan Venkatraman, Jaskaran Singh Walia, Joe Dhanith P R
Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery
Ashim Dahal, Saydul Akbar Murad, Nick Rahimi
Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID
Mei Qiu, Lauren Ann Christopher, Stanley Chien, Lingxi Li
ViTOC: Vision Transformer and Object-aware Captioner
Feiyang Huang
Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation
Sanchar Palit, Sathya Veera Reddy Dendi, Mallikarjuna Talluri, Raj Narayana Gadde
GCI-ViTAL: Gradual Confidence Improvement with Vision Transformers for Active Learning on Label Noise
Moseli Mots'oehli, kyungim Baek
Image inpainting enhancement by replacing the original mask with a self-attended region from the input image
Kourosh Kiani, Razieh Rastgoo, Alireza Chaji, Sergio Escalera
ViT Enhanced Privacy-Preserving Secure Medical Data Sharing and Classification
Al Amin, Kamrul Hasan, Sharif Ullah, M. Shamim Hossain
Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection
Ziqiang Dang, Jianfang Li, Lin Liu