Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Efficient Vision Transformer for Accurate Traffic Sign Detection
Javad Mirzapour Kaleybar, Hooman Khaloo, Avaz Naghipour
Scattering Vision Transformer: Spectral Mixing Matters
Badri N. Patro, Vijay Srinivas Agneeswaran
Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition
Hamid Ahmadabadi, Omid Nejati Manzari, Ahmad Ayatollahi
AiluRus: A Scalable ViT Framework for Dense Prediction
Jin Li, Yaoming Wang, Xiaopeng Zhang, Bowen Shi, Dongsheng Jiang, Chenglin Li, Wenrui Dai, Hongkai Xiong, Qi Tian
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders
Srijan Das, Tanmay Jain, Dominick Reilly, Pranav Balaji, Soumyajit Karmakar, Shyam Marjit, Xiang Li, Abhijit Das, Michael S. Ryoo
Addressing Limitations of State-Aware Imitation Learning for Autonomous Driving
Luca Cultrera, Federico Becattini, Lorenzo Seidenari, Pietro Pala, Alberto Del Bimbo
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks
Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein
ViR: Towards Efficient Vision Retention Backbones
Ali Hatamizadeh, Michael Ranzinger, Shiyi Lan, Jose M. Alvarez, Sanja Fidler, Jan Kautz
Exploiting Image-Related Inductive Biases in Single-Branch Visual Tracking
Chuanming Tang, Kai Wang, Joost van de Weijer, Jianlin Zhang, Yongmei Huang
Blacksmith: Fast Adversarial Training of Vision Transformers via a Mixture of Single-step and Multi-step Methods
Mahdi Salmani, Alireza Dehghanpour Farashah, Mohammad Azizmalayeri, Mahdi Amiri, Navid Eslami, Mohammad Taghi Manzuri, Mohammad Hossein Rohban
Analyzing Vision Transformers for Image Classification in Class Embedding Space
Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig
Concept-based Anomaly Detection in Retail Stores for Automatic Correction using Mobile Robots
Aditya Kapoor, Vartika Sengar, Nijil George, Vighnesh Vatsal, Jayavardhana Gubbi, Balamuralidhar P, Arpan Pal
Ophthalmic Biomarker Detection Using Ensembled Vision Transformers and Knowledge Distillation
H.A.Z. Sameen Shahgir, Khondker Salman Sayeed, Tanjeem Azwad Zaman, Md. Asif Haider, Sheikh Saifur Rahman Jony, M. Sohel Rahman