Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
Vision Language Models See What You Want but not What You See
Qingying Gao, Yijiang Li, Haiyun Lyu, Haoran Sun, Dezhi Luo, Hokin Deng
Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models
Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
Xuefeng Du, Reshmi Ghosh, Robert Sim, Ahmed Salem, Vitor Carvalho, Emily Lawton, Yixuan Li, Jack W. Stokes
Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models
Qi Wu, Zipeng Fu, Xuxin Cheng, Xiaolong Wang, Chelsea Finn
Do Vision-Language Models Really Understand Visual Language?
Buse Giledereli, Yifan Hou, Yilei Tu, Mrinmaya Sachan
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
Qiaojun Yu, Siyuan Huang, Xibin Yuan, Zhengkai Jiang, Ce Hao, Xin Li, Haonan Chang, Junbo Wang, Liu Liu, Hongsheng Li, Peng Gao, Cewu Lu
Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments
Mohamed Elnoor, Kasun Weerakoon, Gershom Seneviratne, Ruiqi Xian, Tianrui Guan, Mohamed Khalid M Jaffar, Vignesh Rajagopal, Dinesh Manocha
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Jiacong Wang, Bohong Wu, Haiyong Jiang, Xun Zhou, Xin Xiao, Haoyuan Guo, Jun Xiao
Resolving Positional Ambiguity in Dialogues by Vision-Language Models for Robot Navigation
Kuan-Lin Chen, Tzu-Ti Wei, Li-Tzu Yeh, Elaine Kao, Yu-Chee Tseng, Jen-Jee Chen
Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function
Chenyi Zhuang, Ying Hu, Pan Gao
FairPIVARA: Reducing and Assessing Biases in CLIP-Based Multimodal Models
Diego A. B. Moreira, Alef Iury Ferreira, Gabriel Oliveira dos Santos, Luiz Pereira, João Medrado Gondim, Gustavo Bonil, Helena Maia, Nádia da Silva, Simone Tiemi Hashiguti, Jefersson A. dos Santos, Helio Pedrini, Sandra Avila
DOTA: Distributional Test-Time Adaptation of Vision-Language Models
Zongbo Han, Jialong Yang, Junfan Li, Qinghua Hu, Qianli Xu, Mike Zheng Shou, Changqing Zhang
TrojVLM: Backdoor Attack Against Vision Language Models
Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, Chao Chen
SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation
Xin Li, Siyuan Huang, Qiaojun Yu, Zhengkai Jiang, Ce Hao, Yimeng Zhu, Hongsheng Li, Peng Gao, Cewu Lu
DARE: Diverse Visual Question Answering with Robustness Evaluation
Hannah Sterz, Jonas Pfeiffer, Ivan Vulić
The Hard Positive Truth about Vision-Language Compositionality
Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, Ranjay Krishna