Masked Multimodal

Masked multimodal learning aims to improve the robustness and efficiency of models processing data from multiple sources (e.g., images, text, audio) by strategically masking and predicting missing modalities during training. Current research focuses on developing novel architectures like Mixture-of-Experts and transformers, employing techniques such as masked autoencoding and cross-modal matching to enhance feature representation and cross-modal alignment. This approach is significant for improving the generalization and efficiency of multimodal models across various applications, including autonomous driving, visual document understanding, and person re-identification, by enabling better handling of incomplete or noisy data.

Papers