Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
Wenhao Li, Yudong Xu, Scott Sanner, Elias Boutros Khalil
Vision Transformer based Random Walk for Group Re-Identification
Guoqing Zhang, Tianqi Liu, Wenxuan Fang, Yuhui Zheng
LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions
Ravindran Kannan, Chiranjib Bhattacharyya, Praneeth Kacham, David P. Woodruff
Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers
Andrew F. Luo, Jacob Yeung, Rushikesh Zawar, Shaurya Dewan, Margaret M. Henderson, Leila Wehbe, Michael J. Tarr
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
Ya-Qi Yu, Minghui Liao, Jiwen Zhang, Jihao Wu
Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering
Kazumoto Nakamura, Yuji Nozawa, Yu-Chieh Lin, Kengo Nakata, Youyang Ng
PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners
Yujin Tang, Lu Qi, Fei Xie, Xiangtai Li, Chao Ma, Ming-Hsuan Yang
GABIC: Graph-based Attention Block for Image Compression
Gabriele Spadaro, Alberto Presta, Enzo Tartaglione, Jhony H. Giraldo, Marco Grangetto, Attilio Fiandrotti
MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation
Gurucharan Marthi Krishna Kumar, Aman Chadha, Janine Mendola, Amir Shmuel
HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer
Jingjing Ren, Xiaoyong Zhang, Lina Zhang