Unimodal Model

Unimodal models, focusing on single data modalities (e.g., text or images), are being increasingly leveraged to build and improve multimodal models that integrate information from multiple sources. Current research emphasizes efficient methods for aligning unimodal representations, often using contrastive learning, projection layers, or Mixture of Experts (MoE) architectures, to create effective multimodal systems. This work is significant because it allows researchers to build powerful multimodal models by leveraging the strengths of existing, well-trained unimodal architectures, reducing computational costs and data requirements while improving performance on tasks like sentiment analysis, activity recognition, and image retrieval.

Papers