Pre Trained Unimodal
Pre-trained unimodal models, trained independently on single modalities like text or images, are increasingly used as building blocks for multimodal applications. Current research focuses on efficiently fusing these pre-trained models, often employing techniques like cross-attention mechanisms or interactive prompting to combine unimodal representations while minimizing computational cost and preserving individual model strengths. This approach offers advantages in scenarios with limited aligned data and allows for improved performance in zero-shot and few-shot learning settings for tasks such as vision-language understanding and multimodal sentiment analysis, ultimately advancing the development of more robust and efficient multimodal systems.