Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
Dota Tianai Dong, Mariya Toneva
SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification
S. M. Nabil Ashraf, Md. Adyelullahil Mamun, Hasnat Md. Abdullah, Md. Golam Rabiul Alam
LT-ViT: A Vision Transformer for multi-label Chest X-ray classification
Umar Marikkar, Sara Atito, Muhammad Awais, Adam Mahdi
Cross-Axis Transformer with 3D Rotary Positional Embeddings
Lily Erickson
FMViT: A multiple-frequency mixing Vision Transformer
Wei Tan, Yifeng Geng, Xuansong Xie
Glioblastoma Tumor Segmentation using an Ensemble of Vision Transformers
Huafeng Liu, Benjamin Dowdell, Todd Engelder, Zarah Pulmano, Nicolas Osa, Arko Barman
Vision Encoder-Decoder Models for AI Coaching
Jyothi S Nayak, Afifah Khan Mohammed Ajmal Khan, Chirag Manjeshwar, Imadh Ajaz Banday
FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer
Chi-Chih Chang, Yuan-Yao Sung, Shixing Yu, Ning-Chi Huang, Diana Marculescu, Kai-Chiang Wu
Mini but Mighty: Finetuning ViTs with Mini Adapters
Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière
Lightweight Portrait Matting via Regional Attention and Refinement
Yatao Zhong, Ilya Zharkov
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, Limin Wang
SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet
Maurice Günder, Facundo Ramón Ispizua Yamati, Abel Andree Barreto Alcántara, Anne-Katrin Mahlein, Rafet Sifa, Christian Bauckhage
GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation
Xuwei Xu, Sen Wang, Yudong Chen, Yanping Zheng, Zhewei Wei, Jiajun Liu
Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers
Hai Phan, Cindy Le, Vu Le, Yihui He, Anh Totti Nguyen