Vision Language Foundation Model
Vision-language foundation models (VLMs) integrate visual and textual information to achieve robust multimodal understanding, aiming to bridge the gap between computer vision and natural language processing. Current research emphasizes improving VLM performance on diverse downstream tasks through techniques like prompt engineering, test-time adaptation, and efficient fine-tuning methods, often leveraging architectures based on CLIP and incorporating large language models. These advancements are significantly impacting various fields, including medical image analysis, autonomous driving, and robotics, by enabling more accurate, efficient, and generalizable solutions for complex tasks.
Papers
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge
Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E. Kinahan, Yu Qiao
Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets
Lucas Choi, Ross Greer
Towards Vision-Language Geo-Foundation Model: A Survey
Yue Zhou, Litong Feng, Yiping Ke, Xue Jiang, Junchi Yan, Xue Yang, Wayne Zhang
Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases
Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham, Dianbo Liu, Wendy Wong, Sahil Thakur, Beau Fenner, Danqi Fang, Siying Liu, Qingyun Liu, Yuqiang Huang, Hongqiang Zeng, Yanda Meng, Yukun Zhou, Zehua Jiang, Minghui Qiu, Changqing Zhang, Xinjian Chen, Sophia Y Wang, Cecilia S Lee, Lucia Sobrin, Carol Y Cheung, Chi Pui Pang, Pearse A Keane, Ching-Yu Cheng, Haoyu Chen, Huazhu Fu