Multimodal VAE

Multimodal Variational Autoencoders (VAEs) aim to learn a joint representation of data from multiple modalities (e.g., images and text), enabling tasks like generation and imputation. Current research focuses on improving the quality and coherence of generated outputs by addressing limitations in existing architectures, such as using diffusion decoders for complex modalities or incorporating Markov Random Fields to better capture inter-modal relationships. These advancements are significant because they enhance the ability of VAEs to model complex, real-world datasets, with applications ranging from neuroscience to medical image synthesis.

Papers