Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination
Xinxin Zhao, Wenzhe Cai, Likun Tang, Teng Wang
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo
Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset
Haoming Lu, Feifei Zhong
CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification
Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, Jinhai Xiang
Debiasing Vison-Language Models with Text-Only Training
Yunfan Yang, Chaoquan Jiang, Zhiyu Lin, Jinlin Xiao, Jiaming Zhang, Jitao Sang
Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Harsh Mahesheka, Zhixian Xie, Zhaoran Wang, Wanxin Jin
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu, Chao Shang, Ling Liu, Nikolaos Pappas, Jie Ma, Neha Anna John, Srikanth Doss, Lluis Marquez, Miguel Ballesteros, Yassine Benajiba
Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation
Kun Ding, Qiang Yu, Haojian Zhang, Gaofeng Meng, Shiming Xiang
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models
Jingyuan Qi, Zhiyang Xu, Rulin Shao, Yang Chen, Jin Di, Yu Cheng, Qifan Wang, Lifu Huang
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model
Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, Chen Feng
Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models
Reza Abbasi, Sernam Lim
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
Mengyuan Chen, Junyu Gao, Changsheng Xu
Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon
Q-VLM: Post-training Quantization for Large Vision-Language Models
Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
Unsupervised Data Validation Methods for Efficient Model Training
Yurii Paniv
HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter
Yumiao Zhao, Bo Jiang, Xiao Wang, Qin Xu, Jin Tang
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks
Hoin Jung, Taeuk Jang, Xiaoqian Wang
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
Seongyun Lee, Geewook Kim, Jiyeon Kim, Hyunji Lee, Hoyeon Chang, Sue Hyun Park, Minjoon Seo