Pre Trained Vision Language Model
Pre-trained vision-language models (VLMs) integrate visual and textual information, aiming to improve multimodal understanding and enable zero-shot or few-shot learning across diverse tasks. Current research focuses on enhancing VLMs' compositional reasoning, adapting them to specialized domains (e.g., agriculture, healthcare), and improving efficiency through quantization and parameter-efficient fine-tuning techniques like prompt learning and adapter modules. These advancements are significant because they enable more robust and efficient applications of VLMs in various fields, ranging from robotics and medical image analysis to open-vocabulary object detection and long-tailed image classification.
Papers
FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs
Sepehr Dehdashtian, Lan Wang, Vishnu Naresh Boddeti
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, Hongliang Ren
Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation
Zicheng Zhang, Tong Zhang, Yi Zhu, Jianzhuang Liu, Xiaodan Liang, QiXiang Ye, Wei Ke
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
Long Lan, Fengxiang Wang, Shuyan Li, Xiangtao Zheng, Zengmao Wang, Xinwang Liu
Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization
Kento Kawaharazuka, Naoaki Kanazawa, Yoshiki Obinata, Kei Okada, Masayuki Inaba