CLIP Vision Encoder
The CLIP vision encoder, a component of the CLIP vision-language model, is a powerful tool for image understanding, enabling tasks like zero-shot image classification and semantic segmentation. Current research focuses on improving its robustness against adversarial attacks and enhancing its performance in open-vocabulary segmentation through techniques like Siamese adversarial fine-tuning, collaborative vision-text optimization, and the incorporation of multiple vision experts. These advancements are significant because they address limitations in existing models and pave the way for more reliable and versatile applications in various fields, including image generation, visual question answering, and multimodal large language models.
Papers
September 30, 2024
September 11, 2024
August 1, 2024
July 20, 2024
April 19, 2024
March 16, 2024
February 20, 2024
February 19, 2024
January 6, 2024
November 28, 2023
October 9, 2023
September 30, 2023
September 12, 2023
June 15, 2023
May 26, 2023
March 20, 2023
March 16, 2023
March 6, 2023