Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting
Qidong Huang, Xiaoyi Dong, Dongdong Chen, Yinpeng Chen, Lu Yuan, Gang Hua, Weiming Zhang, Nenghai Yu
TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective
Jun Dan, Yang Liu, Haoyu Xie, Jiankang Deng, Haoran Xie, Xuansong Xie, Baigui Sun
On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers
Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, Elisa Ricci
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers
Tobias Christian Nauen, Sebastian Palacio, Andreas Dengel
A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis
Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee
A Robust Image Forensic Framework Utilizing Multi-Colorspace Enriched Vision Transformer for Distinguishing Natural and Computer-Generated Images
Manjary P. Gangan, Anoop Kadan, Lajish V L
Large-kernel Attention for Efficient and Robust Brain Lesion Segmentation
Liam Chalcroft, Ruben Lourenço Pereira, Mikael Brudfors, Andrew S. Kayser, Mark D'Esposito, Cathy J. Price, Ioannis Pappas, John Ashburner
Spatial Gated Multi-Layer Perceptron for Land Use and Land Cover Mapping
Ali Jamali, Swalpa Kumar Roy, Danfeng Hong, Peter M Atkinson, Pedram Ghamisi
A degree of image identification at sub-human scales could be possible with more advanced clusters
Prateek Y J
Which Tokens to Use? Investigating Token Reduction in Vision Transformers
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund