Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities
Ying Su, Zhan Ling, Haochen Shi, Jiayang Cheng, Yauwai Yim, Yangqiu Song
The Wallpaper is Ugly: Indoor Localization using Vision and Language
Seth Pate, Lawson L.S. Wong
An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
Ahmed Abdulaal, Hugo Fry, Nina Montaña-Brown, Ayodeji Ijishakin, Jack Gao, Stephanie Hyland, Daniel C. Alexander, Daniel C. Castro
Generalizable Prompt Tuning for Vision-Language Models
Qian Zhang
CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization
Shigemichi Matsuzaki, Kazuhito Tanaka, Kazuhiro Shintani
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang, Anish Kachinthaya, Suzie Petryk, Yossi Gandelsman
Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models
Shuoyuan Wang, Yixuan Li, Hongxin Wei
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model
Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zhou, Daniel Sonntag, Mathias Niepert
Guiding Long-Horizon Task and Motion Planning with Vision Language Models
Zhutian Yang, Caelan Garrett, Dieter Fox, Tomás Lozano-Pérez, Leslie Pack Kaelbling
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu
Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker
Xinlong Hou, Sen Shen, Xueshen Li, Xinran Gao, Ziyi Huang, Steven J. Holiday, Matthew R. Cribbet, Susan W. White, Edward Sazonov, Yu Gan
Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities
Kenza Amara, Lukas Klein, Carsten Lüth, Paul Jäger, Hendrik Strobelt, Mennatallah El-Assady
Toward a Holistic Evaluation of Robustness in CLIP Models
Weijie Tu, Weijian Deng, Tom Gedeon
Information-Theoretical Principled Trade-off between Jailbreakability and Stealthiness on Vision Language Models
Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen
Backdooring Vision-Language Models with Out-Of-Distribution Data
Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, Chao Chen
UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark
Hasnat Md Abdullah, Tian Liu, Kangda Wei, Shu Kong, Ruihong Huang
ScVLM: a Vision-Language Model for Driving Safety Critical Event Understanding
Liang Shi, Boyu Jiang, Feng Guo
Find Everything: A General Vision Language Model Approach to Multi-Object Search
Daniel Choi, Angus Fung, Haitong Wang, Aaron Hao Tan
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
Vision Language Models Know Law of Conservation without Understanding More-or-Less
Dezhi Luo, Haiyun Lyu, Qingying Gao, Haoran Sun, Yijiang Li, Hokin Deng