Image Text Alignment
Image-text alignment focuses on improving the correspondence between visual and textual representations, aiming to create models that accurately understand and generate images based on textual descriptions, or vice-versa. Current research emphasizes enhancing the alignment within various model architectures, including diffusion transformers and vision-language models, often through techniques like contrastive learning, attention modulation, and fine-tuning strategies that leverage large language models or image-to-text concept matching. This work is crucial for advancing applications such as text-to-image generation, image captioning, and weakly supervised semantic segmentation, ultimately leading to more robust and interpretable multimodal AI systems.
Papers
Bringing Multimodality to Amazon Visual Search System
Xinliang Zhu, Michael Huang, Han Ding, Jinyu Yang, Kelvin Chen, Tao Zhou, Tal Neiman, Ouye Xie, Son Tran, Benjamin Yao, Doug Gray, Anuj Bindal, Arnab Dhua
DoPTA: Improving Document Layout Analysis using Patch-Text Alignment
Nikitha SR, Tarun Ram Menta, Mausoom Sarkar
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han
Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models
Yongjin Yang, Sihyeon Kim, Hojung Jung, Sangmin Bae, SangMook Kim, Se-Young Yun, Kimin Lee