Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
VisionArena: 230K Real World User-VLM Conversations with Preference Labels
Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, Wei-Lin Chiang
POINTS1.5: Building a Vision-Language Model towards Real World Applications
Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou
LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba
Yubo Cui, Zhiheng Li, Jiaqiang Wang, Zheng Fang
HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models
Shiding Zhu, Wenhui Dong, Jun Song, Yanan Guo, Bo Zheng
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models
Quang-Hung Le, Long Hoang Dang, Ngan Le, Truyen Tran, Thao Minh Le
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models
Sri Harsha Dumpala, David Arps, Sageev Oore, Laura Kallmeyer, Hassan Sajjad
Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses
Jiayun Luo, Mir Rayat Imtiaz Hossain, Boyang Li, Leonid Sigal
AmCLR: Unified Augmented Learning for Cross-Modal Representations
Ajay Jagannath, Aayush Upadhyay, Anant Mehta
RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models
Greg Heinrich, Mike Ranzinger, Hongxu (Danny)Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, Pavlo Molchanov
Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios
Jiaqi Fan, Jianhua Wu, Hongqing Chu, Quanbo Ge, Bingzhao Gao
MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models
Sayak Chakrabarty, Souradip Pal
Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling
Donggeun Kim, Yujin Jo, Myungjoo Lee, Taesup Kim
Visual Lexicon: Rich Image Features in Language Space
XuDong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid
Ranking-aware adapter for text-driven image ordering with CLIP
Wei-Hsiang Yu, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai
The Narrow Gate: Localized Image-Text Communication in Vision-Language Models
Alessandro Serra, Francesco Ortu, Emanuele Panizon, Lucrezia Valeriani, Lorenzo Basile, Alessio Ansuini, Diego Doimo, Alberto Cazzaniga
DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction
Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization
Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao
Post-hoc Probabilistic Vision-Language Models
Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp
LVP-CLIP:Revisiting CLIP for Continual Learning with Label Vector Pool
Yue Ma, Huantao Ren, Boyu Wang, Jingang Jin, Senem Velipasalar, Qinru Qiu