Multimodal Generative Model

Multimodal generative models aim to create coherent representations and generate data across multiple modalities (e.g., text, images, audio) by learning the relationships between them. Current research emphasizes improving the expressiveness of these models, often using energy-based priors or combining contrastive and reconstruction learning techniques within architectures like transformers and variational autoencoders. This field is significant for advancing artificial intelligence, enabling applications such as improved image captioning, radiology report generation, and more robust and efficient path planning in robotics, while also highlighting and mitigating biases present in training data.

Papers