Multi Modal Image Generation

Multimodal image generation aims to create realistic images from diverse input sources like text, sketches, and other images, offering greater control and creativity than unimodal methods. Current research focuses on combining the strengths of autoregressive and diffusion models, often incorporating large language models for improved context understanding and leveraging pre-trained models to reduce training needs. This field is significant for its potential applications in various creative industries, such as fashion design and art, as well as for advancing computer vision tasks like person re-identification and image fusion.

Papers