Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
Lakshmi Nair, Mikhail Bernadskiy, Arulselvan Madhavan, Craig Chan, Ayon Basumallik, Darius Bunandar
Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray
MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
Jakob Drachmann Havtorn, Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi
Interactive Image Segmentation with Cross-Modality Vision Transformers
Kun Li, George Vosselman, Michael Ying Yang
MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets
Siyi Du, Nourhan Bayasi, Ghassan Hamarneh, Rafeef Garbi
Make A Long Image Short: Adaptive Token Length for Vision Transformers
Qiqi Zhou, Yichen Zhu
WavePaint: Resource-efficient Token-mixer for Self-supervised Inpainting
Pranav Jeevan, Dharshan Sampath Kumar, Amit Sethi
Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision
Xijie Huang, Zhiqiang Shen, Pingcheng Dong, Kwang-Ting Cheng
More for Less: Compact Convolutional Transformers Enable Robust Medical Image Classification with Limited Data
Andrew Kean Gao