Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Learning with SASQuaTCh: a Novel Variational Quantum Transformer Architecture with Kernel-Based Self-Attention
Ethan N. Evans, Matthew Cook, Zachary P. Bradshaw, Margarite L. LaBorde
Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, Yan Yan
SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks
Xinyu Shi, Zecheng Hao, Zhaofei Yu
Accelerating ViT Inference on FPGA through Static and Dynamic Pruning
Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl Busart, Viktor Prasanna
Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers
Yuyang Shu, Michael E. Bain
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, Hyunwoo J. Kim
Rotary Position Embedding for Vision Transformer
Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun
ADAPT to Robustify Prompt Tuning Vision Transformers
Masih Eskandar, Tooba Imtiaz, Zifeng Wang, Jennifer Dy
Improved EATFormer: A Vision Transformer for Medical Image Classification
Yulong Shisu, Susano Mingwin, Yongshuai Wanwag, Zengqiang Chenso, Sunshin Huing
ViTGaze: Gaze Following with Interaction Features in Vision Transformers
Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, Xiangmin Xu
Emotion Recognition Using Transformers with Masked Learning
Seongjae Min, Junseok Yang, Sangjun Lim, Junyong Lee, Sangwon Lee, Sejoon Lim
Machine Learning and Vision Transformers for Thyroid Carcinoma Diagnosis: A review
Yassine Habchi, Hamza Kheddar, Yassine Himeur, Abdelkrim Boukabou, Ammar Chouchane, Abdelmalik Ouamane, Shadi Atalla, Wathiq Mansoor
From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting
Zhen Zeng, Rachneet Kaur, Suchetha Siddagangappa, Tucker Balch, Manuela Veloso
On the low-shot transferability of [V]-Mamba
Diganta Misra, Jay Gala, Antonio Orvieto
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Haoyang Liu, Aditya Singh, Yijiang Li, Haohan Wang
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
Pingping Zhang, Yuhao Wang, Yang Liu, Zhengzheng Tu, Huchuan Lu
When Training-Free NAS Meets Vision Transformer: A Neural Tangent Kernel Perspective
Qiqi Zhou, Yichen Zhu