Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model
Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
Muchao Ye, Weiyang Liu, Pan He
Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA
Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP
Naman Deep Singh, Francesco Croce, Matthias Hein
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni, Pooyan Fazli
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu, Yanfeng Wang
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu
SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks
Kim-Celine Kahl, Selen Erkan, Jeremias Traub, Carsten T. Lüth, Klaus Maier-Hein, Lena Maier-Hein, Paul F. Jaeger
Interleaved-Modal Chain-of-Thought
Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li
Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis
Ruoqi Wang, Haitao Wang, Qiong Luo
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
Jeongho Ju, Daeyoung Kim, SunYoung Park, Youngjune Kim
SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo