Image Text Alignment
Image-text alignment focuses on improving the correspondence between visual and textual representations, aiming to create models that accurately understand and generate images based on textual descriptions, or vice-versa. Current research emphasizes enhancing the alignment within various model architectures, including diffusion transformers and vision-language models, often through techniques like contrastive learning, attention modulation, and fine-tuning strategies that leverage large language models or image-to-text concept matching. This work is crucial for advancing applications such as text-to-image generation, image captioning, and weakly supervised semantic segmentation, ultimately leading to more robust and interpretable multimodal AI systems.
Papers
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han
Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models
Yongjin Yang, Sihyeon Kim, Hojung Jung, Sangmin Bae, SangMook Kim, Se-Young Yun, Kimin Lee
Unlocking Intrinsic Fairness in Stable Diffusion
Eunji Kim, Siwon Kim, Rahim Entezari, Sungroh Yoon
OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion
Guoting Wei, Xia Yuan, Yu Liu, Zhenhao Shang, Kelu Yao, Chao Li, Qingsen Yan, Chunxia Zhao, Haokui Zhang, Rong Xiao