Vision Transformer
Vision Transformers (ViTs) adapt the transformer architecture, initially designed for natural language processing, to image analysis by treating images as sequences of patches. Current research focuses on improving ViT efficiency and robustness through techniques like token pruning, attention engineering, and hybrid models combining ViTs with convolutional neural networks or other architectures (e.g., Mamba). These advancements are driving progress in various applications, including medical image analysis, object detection, and spatiotemporal prediction, by offering improved accuracy and efficiency compared to traditional convolutional neural networks in specific tasks.
Papers
Exploring Camera Encoder Designs for Autonomous Driving Perception
Barath Lakshmanan, Joshua Chen, Shiyi Lan, Maying Shen, Zhiding Yu, Jose M. Alvarez
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach
Taolin Zhang, Jiawang Bai, Zhihe Lu, Dongze Lian, Genping Wang, Xinchao Wang, Shu-Tao Xia
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
Yunshan Zhong, Jiawei Hu, You Huang, Yuxin Zhang, Rongrong Ji
Integrating Query-aware Segmentation and Cross-Attention for Robust VQA
Wonjun Choi, Sangbeom Lee, Seungyeon Lee, Heechul Jung, Dong-Gyu Lee
Enhancing Robotic Arm Activity Recognition with Vision Transformers and Wavelet-Transformed Channel State Information
Rojin Zandi, Kian Behzad, Elaheh Motamedi, Hojjat Salehinejad, Milad Siami
Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals
Moritz Reuss, Ömer Erdinç Yağmurlu, Fabian Wenzel, Rudolf Lioutikov
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning
Bin Ren, Guofeng Mei, Danda Pani Paudel, Weijie Wang, Yawei Li, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe
Wavelet Convolutions for Large Receptive Fields
Shahaf E. Finder, Roy Amoyal, Eran Treister, Oren Freifeld
Learning Motion Blur Robust Vision Transformers with Dynamic Early Exit for Real-Time UAV Tracking
You Wu, Xucheng Wang, Dan Zeng, Hengzhou Ye, Xiaolan Xie, Qijun Zhao, Shuiwang Li
CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs
Akshat Ramachandran, Souvik Kundu, Tushar Krishna
Towards Attention-based Contrastive Learning for Audio Spoof Detection
Chirag Goel, Surya Koppisetti, Ben Colman, Ali Shahriyari, Gaurav Bharaj
PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition
Yanbin Hao, Diansong Zhou, Zhicai Wang, Chong-Wah Ngo, Meng Wang
ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers
Yanfeng Jiang, Ning Sun, Xueshuo Xie, Fei Yang, Tao Li
Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge
Nick John Eliopoulos, Purvish Jajal, James C. Davis, Gaowen Liu, George K. Thiravathukal, Yung-Hsiang Lu
Transferable-guided Attention Is All You Need for Video Domain Adaptation
André Sacilotti, Samuel Felipe dos Santos, Nicu Sebe, Jurandy Almeida