Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments
Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Mohamed Elnoor, Anuj Zore, Brian Ichter, Fei Xia, Jie Tan, Wenhao Yu, Dinesh Manocha
Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Sanghyun Seo
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, Tongliang Liu
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng, Runyu Ding, Jihan Yang, Jiahui Liu, Yijiang Li, Xiaojuan Qi, Edith Ngai
MyVLM: Personalizing VLMs for User-Specific Queries
Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
Exosense: A Vision-Centric Scene Understanding System For Safe Exoskeleton Navigation
Jianeng Wang, Matias Mattamala, Christina Kassab, Lintong Zhang, Maurice Fallon
C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion
Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, Chang D. Yoo
Bridge the Modality and Capability Gaps in Vision-Language Model Selection
Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye
CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models
Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Bohua Chen, Yemin Shi, Chengwei Pan
Negative Yields Positive: Unified Dual-Path Adapter for Vision-Language Models
Ce Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie
Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
Elaine Sui, Xiaohan Wang, Serena Yeung-Levy
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, Abhanshu Sharma
FlexCap: Generating Rich, Localized, and Flexible Captions in Images
Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, Andrew Markham
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuhene, Horst Possegger
Compositional Kronecker Context Optimization for Vision-Language Models
Kun Ding, Xiaohui Li, Qiang Yu, Ying Wang, Haojian Zhang, Shiming Xiang
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, You He
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li