Image Text Pair
Image-text pairs are fundamental to training multimodal models that understand and generate both visual and textual information. Current research focuses on improving the alignment between image and text representations, often employing contrastive learning, multi-graph alignment, and various attention mechanisms within transformer-based architectures. These advancements aim to address challenges like data scarcity, compositional understanding, and robustness to noise and adversarial attacks, ultimately leading to more accurate and efficient vision-language models. The resulting improvements have significant implications for various applications, including image retrieval, text-to-image generation, and medical image analysis.
Papers
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
Ke Wang, Hong Xuan
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun
Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification
Neng Dong, Shuanglin Yan, Liyan Zhang, Jinhui Tang
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
Tomáš Souček, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, Josef Sivic
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Sen Xing, Muyan Zhong, Zeqiang Lai, Liangchen Li, Jiawen Liu, Yaohui Wang, Jifeng Dai, Wenhai Wang