Vision Encoders
Vision encoders are the core components of multimodal models, tasked with transforming images into numerical representations that can be understood by language models. Current research focuses on improving these encoders, exploring architectures like Vision Transformers (ViTs) and incorporating techniques such as knowledge distillation and multimodal contrastive learning to enhance performance on various tasks, including image captioning, visual question answering, and object detection. This research is significant because advancements in vision encoders directly impact the capabilities of larger vision-language models, leading to improvements in applications ranging from autonomous driving to medical image analysis.
Papers
June 6, 2024
June 2, 2024
May 28, 2024
May 8, 2024
April 19, 2024
April 10, 2024
March 14, 2024
February 22, 2024
February 8, 2024
February 2, 2024
January 22, 2024
January 19, 2024
December 21, 2023
December 16, 2023
November 15, 2023
November 7, 2023
October 13, 2023
August 2, 2023
June 13, 2023