Paper ID: 2405.15289

Learning Invariant Causal Mechanism from Vision-Language Models

Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

Large-scale pre-trained vision-language models such as CLIP have been widely applied to a variety of downstream scenarios. In real-world applications, the CLIP model is often utilized in more diverse scenarios than those encountered during its training, a challenge known as the out-of-distribution (OOD) problem. However, our experiments reveal that CLIP performs unsatisfactorily in certain domains. Through a causal analysis, we find that CLIP's current prediction process cannot guarantee a low OOD risk. The lowest OOD risk can be achieved when the prediction process is based on invariant causal mechanisms, i.e., predicting solely based on invariant latent factors. However, theoretical analysis indicates that CLIP does not identify these invariant latent factors. Therefore, we propose the Invariant Causal Mechanism for CLIP (CLIP-ICM), a framework that first identifies invariant latent factors using interventional data and then performs invariant predictions across various domains. Our method is simple yet effective, without significant computational overhead. Experimental results demonstrate that CLIP-ICM significantly improves CLIP's performance in OOD scenarios.

Submitted: May 24, 2024