Zero Shot Vision Language
Zero-shot vision-language models aim to enable computers to understand and reason about images and text without explicit training on specific tasks. Current research focuses on improving these models' performance by leveraging pre-trained unimodal (image-only or text-only) encoders, developing novel pre-training tasks like image-caption correction, and exploring multi-teacher distillation to combine the strengths of different architectures. This field is significant because it advances the development of more efficient and robust AI systems capable of handling diverse real-world scenarios, impacting applications ranging from image captioning and visual question answering to semantic segmentation and object detection.
Papers
September 28, 2024
May 2, 2024
April 1, 2024
January 23, 2024
December 10, 2023
December 4, 2023
October 20, 2023
July 3, 2023
June 7, 2023
January 26, 2023
January 5, 2023