Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
Disease-informed Adaptation of Vision-Language Models
Jiajin Zhang, Ge Wang, Mannudeep K. Kalra, Pingkun Yan
Composed Image Retrieval for Remote Sensing
Bill Psomas, Ioannis Kakogeorgiou, Nikos Efthymiadis, Giorgos Tolias, Ondrej Chum, Yannis Avrithis, Konstantinos Karantzalos
Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding
Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Xiangtai Li, Yabiao Wang, Yong Liu
Learning Invariant Causal Mechanism from Vision-Language Models
Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang
How Culturally Aware are Vision-Language Models?
Olena Burda-Lassen, Aman Chadha, Shashank Goswami, Vinija Jain
CLIP model is an Efficient Online Lifelong Learner
Leyuan Wang, Liuyu Xiang, Yujie Wei, Yunlong Wang, Zhaofeng He
A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models
Mario Döbler, Robert A. Marsden, Tobias Raichle, Bin Yang
Pre-Trained Vision-Language Models as Partial Annotators
Qian-Wei Wang, Yuqiu Xie, Letian Zhang, Zimo Liu, Shu-Tao Xia
Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
Young Kyun Jang, Ser-nam Lim
Calibrated Self-Rewarding Vision Language Models
Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao
AnomalyDINO: Boosting Patch-based Few-shot Anomaly Detection with DINOv2
Simon Damm, Mike Laszkiewicz, Johannes Lederer, Asja Fischer
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation
Se-eun Yoon, Hyunsik Jeon, Julian McAuley
Refining Skewed Perceptions in Vision-Language Models through Visual Representations
Haocheng Dai, Sarang Joshi
Safety Alignment for Vision Language Models
Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng
Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance
Kaifeng Zhang, Zhao-Heng Yin, Weirui Ye, Yang Gao
What Makes Good Few-shot Examples for Vision-Language Models?
Zhaojun Guo, Jinghui Lu, Xuejing Liu, Rui Zhao, ZhenXing Qian, Fei Tan
Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards
Xiaoyu Yang, Jie Lu, En Yu
More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models
Messi H. J. Lee, Jacob M. Montgomery, Calvin K. Lai