ViT Lens
ViT-Lens research focuses on adapting the strengths of Vision Transformers (ViTs) to diverse data modalities beyond standard images, aiming to create more versatile and powerful AI models. Current work centers on developing efficient methods for projecting various data types (e.g., EEG, 3D point clouds, audio) into a shared representation space processable by pre-trained ViTs, often incorporating novel attention mechanisms or hybrid CNN-ViT architectures. This approach promises to improve the efficiency and generalizability of AI systems across a wider range of applications, particularly in areas like medical imaging, video analysis, and robotics, where multimodal data is prevalent.
Papers
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer
ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer
Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao