Image Text Pair
Image-text pairs are fundamental to training multimodal models that understand and generate both visual and textual information. Current research focuses on improving the alignment between image and text representations, often employing contrastive learning, multi-graph alignment, and various attention mechanisms within transformer-based architectures. These advancements aim to address challenges like data scarcity, compositional understanding, and robustness to noise and adversarial attacks, ultimately leading to more accurate and efficient vision-language models. The resulting improvements have significant implications for various applications, including image retrieval, text-to-image generation, and medical image analysis.
Papers
Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness
Liangliang Cao, Bowen Zhang, Chen Chen, Yinfei Yang, Xianzhi Du, Wencong Zhang, Zhiyun Lu, Yantao Zheng
Scene Text Recognition with Image-Text Matching-guided Dictionary
Jiajun Wei, Hongjian Zhan, Xiao Tu, Yue Lu, Umapada Pal
Three ways to improve feature alignment for open vocabulary detection
Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Baldridge, Jiahui Yu