Unimodal Model
Unimodal models, focusing on single data modalities (e.g., text or images), are being increasingly leveraged to build and improve multimodal models that integrate information from multiple sources. Current research emphasizes efficient methods for aligning unimodal representations, often using contrastive learning, projection layers, or Mixture of Experts (MoE) architectures, to create effective multimodal systems. This work is significant because it allows researchers to build powerful multimodal models by leveraging the strengths of existing, well-trained unimodal architectures, reducing computational costs and data requirements while improving performance on tasks like sentiment analysis, activity recognition, and image retrieval.
Papers
UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks
Yanan Sun, Zihan Zhong, Qi Fan, Chi-Keung Tang, Yu-Wing Tai
Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications
Paul Pu Liang, Chun Kai Ling, Yun Cheng, Alex Obolenskiy, Yudong Liu, Rohan Pandey, Alex Wilf, Louis-Philippe Morency, Ruslan Salakhutdinov