Vision Language Fusion
Vision-language fusion aims to integrate visual and textual information for improved understanding and reasoning in computer vision tasks. Current research heavily utilizes transformer-based architectures, often incorporating pre-trained vision-language models like CLIP and exploring techniques such as prompt tuning and early/late fusion strategies to effectively combine image and text features. This field is driving advancements in various applications, including object detection (especially open-vocabulary and aerial), visual grounding, and fine-grained image classification, by enabling more robust and context-aware systems. The resulting models demonstrate improved accuracy and efficiency compared to purely vision-based approaches.
Papers
November 19, 2024
September 26, 2024
August 22, 2024
August 2, 2024
July 16, 2024
June 28, 2024
February 3, 2024
December 17, 2023
September 16, 2023
August 18, 2023
July 23, 2023
April 20, 2023
November 22, 2021