Pre Trained Uni Modal

Pre-trained unimodal models, already powerful in their respective domains (e.g., image, text, audio), are increasingly leveraged to build more effective multimodal systems. Current research focuses on efficient and effective methods for integrating these pre-trained models, often employing architectures like Mixture of Experts (MoE) or novel fusion strategies to overcome challenges such as modality-specific biases and computational limitations. This approach allows for the creation of robust multimodal systems with reduced training data requirements and improved performance across various downstream tasks, impacting fields like natural language processing, computer vision, and audio analysis.

Papers