U CLIP Update
Recent research significantly expands the capabilities and applications of CLIP (Contrastive Language–Image Pre-training), a foundational vision-language model. Current efforts focus on improving CLIP's robustness by mitigating hallucinations in large vision-language models (LVLMs) and enhancing its generalization across diverse tasks like deepfake detection, medical image analysis (e.g., diabetic retinopathy grading), and few-shot out-of-distribution detection. These advancements leverage techniques such as preference optimization, prompt engineering, and novel architectural modifications to CLIP, resulting in improved accuracy and explainability. The resulting improvements have significant implications for various fields, including healthcare, social media analysis, and computer vision.
Papers
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection
Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, Yunchao Wei