Multimodal Variational AutoEncoders

Multimodal Variational Autoencoders (VAEs) are generative models designed to learn joint representations from data encompassing multiple modalities (e.g., images, text, sensor readings). Current research emphasizes improving the handling of complex inter-modal relationships, often through advanced architectures like Markov Random Fields or incorporating contrastive learning and normalizing flows to enhance generative capabilities and disentangle shared and private latent factors. These advancements are driving progress in diverse applications, including cross-modal retrieval, robotic manipulation, and medical diagnosis, by enabling more effective data integration and improved model interpretability.

Papers