CLIP Representation

CLIP representations, derived from large pre-trained vision-language models, are revolutionizing multimodal learning by aligning image and text embeddings. Current research focuses on improving CLIP's performance in various downstream tasks, including object detection, semantic segmentation, and deepfake detection, often through modifications to the model architecture or by incorporating collaborative vision-text optimization strategies. These advancements are significantly impacting fields like computer vision and natural language processing, enabling more robust and interpretable models for applications ranging from image captioning to multimodal classification. The development of methods to enhance CLIP's generalization capabilities, particularly in out-of-distribution scenarios and for compositional understanding, remains a key area of investigation.

Papers

September 23, 2024

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification
Siddhant Bikram Shah, Shuvam Shiwakoti, Maheep Chaudhary, Haohan Wang
Internet Meme Text Image Multimodal Analysis CLIP Representation Multimodal Meme

September 12, 2024

DeCLIP: Decoding CLIP representations for deepfake localization
Stefan Smeu, Elisabeta Oneata, Dan Oneata
Generative Model Deepfake Detection Generative Approach CLIP Representation

August 1, 2024

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yunchao Wei, Humphrey Shi
Vision Paper Open Vocabulary Semantic Segmentation Open Vocabulary Segmentation CLIP Vision Encoder Image Text Representation Text Feature CLIP Representation

July 17, 2024

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang
Semantic Segmentation Vision Language Model Open Vocabulary Semantic Segmentation Segmentation Quality CLIP Representation

July 8, 2024

Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models
Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Integral Role Distribution Generalization Compositional Generalization Image Representation CLIP Model Compositional Structure CLIP Training Representation Disentanglement CLIP Representation

February 16, 2024

Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P. Calmon, Himabindu Lakkaraju
Single CLIP CLIP Embeddings Concept Embeddings CLIP Representation

April 26, 2023

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping
Junyang Wang, Ming Yan, Yi Zhang, Jitao Sang
Faithful Generation Cross Modal Image Captioning Image Caption Video Captioning Modality Gap Association Capability CLIP Representation

March 16, 2023

GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning
Jiayi Lin, Shaogang Gong
Grid Representation CLIP Representation Stage Object Detection

February 23, 2023

Controlled and Conditional Text to Image Generation with Diffusion Prior
Pranav Aggarwal, Hareesh Ravi, Naveen Marri, Sachin Kelkar, Fengbin Chen, Vinh Khuc, Midhun Harikumar, Ritiz Tambi, Sudharshan Reddy Kakumanu, Purvak Lapsiya, Alvin Ghouas, Sarah Saber, Malavika Ramprasad, Baldo Faieta, Ajinkya Kale
Image Generation Diffusion Prior Denoising Diffusion Model Conditional Reasoning Diffusion Decoder CLIP Representation

December 6, 2022

Fine-tuned CLIP Models are Efficient Video Learners
Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan
Single CLIP CLIP Model CLIP Representation Efficient Video

November 17, 2022

CAE v2: Context Autoencoder with CLIP Target
Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
Single CLIP Image Modeling Masked Image Modeling CLIP Level CLIP Representation Mask Wearing Ratio Context Autoencoder