Multimodal Generative Model
Multimodal generative models aim to create coherent representations and generate data across multiple modalities (e.g., text, images, audio) by learning the relationships between them. Current research emphasizes improving the expressiveness of these models, often using energy-based priors or combining contrastive and reconstruction learning techniques within architectures like transformers and variational autoencoders. This field is significant for advancing artificial intelligence, enabling applications such as improved image captioning, radiology report generation, and more robust and efficient path planning in robotics, while also highlighting and mitigating biases present in training data.
Papers
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie
FastRM: An efficient and automatic explainability framework for multimodal generative models
Gabriela Ben-Melech Stan, Estelle Aflalo, Man Luo, Shachar Rosenman, Tiep Le, Sayak Paul, Shao-Yen Tseng, Vasudev Lal
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction
Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models
Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian