Compositional Text to Image

Compositional text-to-image generation aims to create images accurately reflecting complex textual descriptions involving multiple objects, attributes, and relationships. Current research focuses on improving the controllability and accuracy of these models, often employing diffusion models and large language models (LLMs) to guide the generation process, addressing issues like attribute misalignment and object omission through techniques such as attention map manipulation and scene decomposition. These advancements are significant for both advancing fundamental understanding of multimodal generation and enabling applications requiring precise visual representations from nuanced textual inputs.

Papers