Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
SCoTT: Wireless-Aware Path Planning with Vision Language Models and Strategic Chains-of-Thought
Aladin Djuhera, Vlad C. Andrei, Amin Seffo, Holger Boche, Walid Saad
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
Zizhao Li, Zhengkang Xiang, Joseph West, Kourosh Khoshelham
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang, Junxian Li, Jingdi Lei, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou
DistinctAD: Distinctive Audio Description Generation in Contexts
Bo Fang, Wenhao Wu, Qiangqiang Wu, Yuxin Song, Antoni B. Chan
DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models
Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui kang, Yu Wang
Aligning Knowledge Concepts to Whole Slide Images for Precise Histopathology Image Analysis
Weiqin Zhao, Ziyu Guo, Yinshuang Fan, Yuming Jiang, Maximus Yeung, Lequan Yu
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang, Dasol Jeong, Hyunmin Lee, Sangwoo Park, Hasil Park, Sunkyu Kwon, Yeongjoon Kim, Joonki Paik
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao, Bryan Hooi, Jun Liu, Kai-Wei Chang, Zi Huang, Yujun Cai
Verbalized Representation Learning for Interpretable Few-Shot Generalization
Cheng-Fu Yang, Da Yin, Wenbo Hu, Nanyun Peng, Bolei Zhou, Kai-Wei Chang
What's in the Image? A Deep-Dive into the Vision of Vision Language Models
Omri Kaduri, Shai Bagon, Tali Dekel
CoA: Chain-of-Action for Generative Semantic Labels
Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, Seong Jae Hwang
Probing the limitations of multimodal language models for chemistry and materials research
Nawaf Alampara, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, N. M. Anoop Krishnan, Kevin Maik Jablonka
A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models
Manuel Schwonberg, Claus Werner, Hanno Gottschalk, Carsten Meyer
Open-Vocabulary Octree-Graph for 3D Scene Understanding
Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Bin Zhao, Xuelong Li
Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models
Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah
Improving Medical Diagnostics with Vision-Language Models: Convex Hull-Based Uncertainty Analysis
Ferhat Ozgur Catak, Murat Kuzlu, Taylor Patrick
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan
Test-time Alignment-Enhanced Adapter for Vision-Language Models
Baoshun Tong, Kaiyu Song, Hanjiang Lai
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen