Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Transformer-based Image and Video Inpainting: Current Challenges and Future Directions
Omar Elharrouss, Rafat Damseh, Abdelkader Nasreddine Belkacem, Elarbi Badidi, Abderrahmane Lakas
Vision Transformer with Key-select Routing Attention for Single Image Dehazing
Lihan Tong, Weijia Li, Qingxia Yang, Liyuan Chen, Peng Chen
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Ali Khaleghi Rahimian, Manish Kumar Govind, Subhajit Maity, Dominick Reilly, Christian Kümmerle, Srijan Das, Aritra Dutta
ViT LoS V2X: Vision Transformers for Environment-aware LoS Blockage Prediction for 6G Vehicular Networks
Ghazi Gharsallah, Georges Kaddoum
Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge
John Violos, Symeon Papadopoulos, Ioannis Kompatsiaris
Brain Tumor Classification using Vision Transformer with Selective Cross-Attention Mechanism and Feature Calibration
Mohammad Ali Labbaf Khaniki, Marzieh Mirzaeibonehkhater, Mohammad Manthouri, Elham Hasani
Investigating Self-Supervised Methods for Label-Efficient Learning
Srinivasa Rao Nandam, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammad Awais
Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks
Zihao Jin, Yingying Fang, Jiahao Huang, Caiwen Xu, Simon Walsh, Guang Yang
Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization
Siyavash Shabani, Muhammad Sohaib, Sahar A. Mohammed, Bahram Parvin
Multi-Modal Vision Transformers for Crop Mapping from Satellite Image Time Series
Theresa Follath, David Mickisch, Jan Hemmerling, Stefan Erasmi, Marcel Schwieder, Begüm Demir
Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation?
Pallabi Dutta, Soham Bose, Swalpa Kumar Roy, Sushmita Mitra
SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition
Liutao Yu, Liwei Huang, Chenlin Zhou, Han Zhang, Zhengyu Ma, Huihui Zhou, Yonghong Tian
SiT: Symmetry-Invariant Transformers for Generalisation in Reinforcement Learning
Matthias Weissenbacher, Rishabh Agarwal, Yoshinobu Kawahara
Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks
Minjong Cheon
PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers
Mohammad Erfan Sadeghi, Arash Fayyazi, Seyedarmin Azizi, Massoud Pedram