Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Transformers are Universal In-context Learners
Takashi Furuya, Maarten V. de Hoop, Gabriel Peyré
Privacy-Preserving Split Learning with Vision Transformers using Patch-Wise Random and Noisy CutMix
Seungeun Oh, Sihun Baek, Jihong Park, Hyelin Nam, Praneeth Vepakomma, Ramesh Raskar, Mehdi Bennis, Seong-Lyun Kim
An Explainable Vision Transformer with Transfer Learning Combined with Support Vector Machine Based Efficient Drought Stress Identification
Aswini Kumar Patra, Ankit Varshney, Lingaraj Sahoo
SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving
Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu
MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity
Kanghyun Choi, Hye Yoon Lee, Dain Kwon, SunJong Park, Kyuyeun Kim, Noseong Park, Jinho Lee
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul
Twins-PainViT: Towards a Modality-Agnostic Vision Transformer Framework for Multimodal Automatic Pain Assessment using Facial Videos and fNIRS
Stefanos Gkikas, Manolis Tsiknakis
Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets
Muhammad Abdullah Jamal, Omid Mohareri
VSSD: Vision Mamba with Non-Causal State Space Duality
Yuheng Shi, Minjing Dong, Mingjia Li, Chang Xu
Skin Cancer Detection utilizing Deep Learning: Classification of Skin Lesion Images using a Vision Transformer
Carolin Flosdorf, Justin Engelker, Igor Keller, Nicolas Mohr
Mixed Non-linear Quantization for Vision Transformers
Gihwan Kim, Jemin Lee, Sihyeong Park, Yongin Kwon, Hyungshin Kim
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers
Zhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman Fang
HG-PIPE: Vision Transformer Acceleration with Hybrid-Grained Pipeline
Qingyu Guo, Jiayong Wan, Songqiang Xu, Meng Li, Yuan Wang
How Lightweight Can A Vision Transformer Be
Jen Hong Tan