CLIP Score

CLIP score is a metric used to evaluate the alignment between image and text embeddings generated by contrastive language-image pre-training (CLIP) models. Current research focuses on improving CLIP score's effectiveness for data selection in training larger visual-language models, mitigating biases like the over-reliance on textual cues within images, and adapting it for downstream tasks such as object counting and video quality assessment. These efforts aim to enhance the robustness and reliability of CLIP models, leading to improved performance in various applications including image retrieval, caption generation, and robotic perception.

Papers