Image Text Pair

Image-text pairs are fundamental to training multimodal models that understand and generate both visual and textual information. Current research focuses on improving the alignment between image and text representations, often employing contrastive learning, multi-graph alignment, and various attention mechanisms within transformer-based architectures. These advancements aim to address challenges like data scarcity, compositional understanding, and robustness to noise and adversarial attacks, ultimately leading to more accurate and efficient vision-language models. The resulting improvements have significant implications for various applications, including image retrieval, text-to-image generation, and medical image analysis.

Papers

July 10, 2024

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment
Zhifang Zhang, Yuwei Niu, Xin Liu, Beibei Li
Vision Language Model Image Text Pair High Quality Representation Candidate Label Label Disambiguation Prompt Alignment

July 6, 2024

A Study of Test-time Contrastive Concepts for Open-world, Open-vocabulary Semantic Segmentation
Monika Wysoczańska, Antonin Vobecky, Amaia Cardiel, Tomasz Trzciński, Renaud Marlet, Andrei Bursuc, Oriane Siméoni
Study Feature Open World Image Text Pair Natural Language Query Open Vocabulary Semantic Segmentation Image Region Text Contrastive Learning

June 27, 2024

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs
Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama
Image Text Pair Textual Inversion Zero Shot Composed Image Retrieval

June 26, 2024

MATE: Meet At The Embedding -- Connecting Images with Long Texts
Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim
Vision Language Model Jina Embeddings Image Text Pair Long Text LLM Embeddings Cross Modal Retrieval Benchmark

June 16, 2024

Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models
Yikai Zhang, Qianyu He, Xintao Wang, Siyu Yuan, Jiaqing Liang, Yanghua Xiao
Vision Language Model Image Text Pair Quantum Shadow Light Work Multi Modal Knowledge Graph Tail Entity

June 12, 2024

June 9, 2024

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval
Yiwei Ma, Xiaoshuai Sun, Jiayi Ji, Guannan Jiang, Weilin Zhuang, Rongrong Ji
Alignment Problem Image Text Pair Image Text Retrieval Rhythm Game Many to Many Text Based Person Text Based Person Retrieval

June 3, 2024

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization
Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao, Yang Liu, Ke Liu, Kun Yi, Wei Fan, Liang Hu, Changwei Wang
Contrastive Language Image Image Text Pair Token Level Multi View Image Contrastive Self Supervision Language Image Pre Training

May 29, 2024

May 25, 2024

Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples
Dae Ung Jo, Kyuewang Lee, JaeHo Chung, Jin Young Choi
Active Learning Image Text Pair Negative Sampling Retrieval Model COCO Dataset Image Text Retrieval Active Learning Algorithm

May 23, 2024

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai
Language Model Multimodal Large Language Model Image Text Pair Multi Modal Large Language Model Cross Modal Alignment LD Align Alignment Performance

May 22, 2024

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Angéline Pouget, Lucas Beyer, Emanuele Bugliarello, Xiao Wang, Andreas Peter Steiner, Xiaohua Zhai, Ibrahim Alabdulmohsin
Image Text Pair Diverse Datasets Contrastive Vision Language Local Culture Multimodal System Social Class

April 18, 2024

April 17, 2024

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene
Wenbo Zhang, Yifan Zhang, Jianfeng Lin, Binqiang Huang, Jinlu Zhang, Wenhao Yu
Knowledge Distillation Alignment Problem Image Text Pair Vision Language Alignment Multilingual Scenario Progressive Alignment Multilingual Vision Multilingual CLIP

April 12, 2024

Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu, Tongkai Shi, Liqing Gao, Zekang Liu, Wei Feng
Large Vision Language Model Image Text Pair Image Modeling Continuous Sign Language Recognition Frame Wise

Image Text Pair

Papers

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

A Study of Test-time Contrastive Concepts for Open-world, Open-vocabulary Semantic Segmentation

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

MATE: Meet At The Embedding -- Connecting Images with Long Texts

Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models

What If We Recaption Billions of Web Images with LLaMA-3?

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

Enhancing Vision-Language Model with Unmasked Token Alignment

Leveraging Many-To-Many Relationships for Defending Against Visual-Language Adversarial Attacks

Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

EdgeFusion: On-Device Text-to-Image Generation

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Improving Continuous Sign Language Recognition with Adapted Image Models