Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs
Abhas Kumar, Kapil Pathak, Rajesh Kavuru, Prabhakar Srinivasan
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey
Chenyang Liu, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, Zhenwei Shi
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
Joongwon Chae, Zhenyu Wang, Peiwu Qin
Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks
Jinjin Cai, Kexin Meng, Baijian Yang, Gang Shao
BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding
Chenguang Huang, Shengchao Yan, Wolfram Burgard
Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation
Sepand Dyanatkar, Angran Li, Alexander Dungate
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model
Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
Muchao Ye, Weiyang Liu, Pan He
Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA
Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP
Naman Deep Singh, Francesco Croce, Matthias Hein
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni, Pooyan Fazli
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu, Yanfeng Wang
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu
SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks
Kim-Celine Kahl, Selen Erkan, Jeremias Traub, Carsten T. Lüth, Klaus Maier-Hein, Lena Maier-Hein, Paul F. Jaeger
Interleaved-Modal Chain-of-Thought
Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li
Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis
Ruoqi Wang, Haitao Wang, Qiong Luo