Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification
Delfina Sol Martinez Pandiani, Nicolas Lazzari, Valentina Presutti
Loss-Free Machine Unlearning
Jack Foster, Stefan Schoepf, Alexandra Brintrup
A Simple yet Effective Network based on Vision Transformer for Camouflaged Object and Salient Object Detection
Chao Hao, Zitong Yu, Xin Liu, Jun Xu, Huanjing Yue, Jingyu Yang
Vision Transformers with Natural Language Semantics
Young Kyung Kim, J. Matías Di Martino, Guillermo Sapiro
Massive Activations in Large Language Models
Mingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu
ViTaL: An Advanced Framework for Automated Plant Disease Identification in Leaf Images Using Vision Transformers and Linear Projection For Feature Reduction
Abhishek Sebastian, Annis Fathima A, Pragna R, Madhan Kumar S, Yaswanth Kannan G, Vinay Murali
Training Neural Networks from Scratch with Parallel Low-Rank Adapters
Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal
Investigating the Robustness of Vision Transformers against Label Noise in Medical Image Classification
Bidur Khanal, Prashant Shrestha, Sanskar Amgain, Bishesh Khanal, Binod Bhattarai, Cristian A. Linte
ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer
Bowen Dong, Guanglei Yang, Wangmeng Zuo, Lei Zhang
FViT: A Focal Vision Transformer with Gabor Filter
Yulong Shi, Mingwei Sun, Yongshuai Wang, Rui Wang, Hui Sun, Zengqiang Chen
ReViT: Enhancing Vision Transformers with Attention Residual Connections for Visual Recognition
Anxhelo Diko, Danilo Avola, Marco Cascio, Luigi Cinque
DiffPoint: Single and Multi-view Point Cloud Reconstruction with ViT Based Diffusion Model
Yu Feng, Xing Shi, Mengli Cheng, Yun Xiong