Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
VLM-Vac: Enhancing Smart Vacuums through VLM Knowledge Distillation and Language-Guided Experience Replay
Reihaneh Mirjalili, Michael Krawez, Florian Walter, Wolfram Burgard
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data
Grace Tang, Swetha Rajkumar, Yifei Zhou, Homer Rich Walke, Sergey Levine, Kuan Fang
OLiVia-Nav: An Online Lifelong Vision Language Approach for Mobile Robot Social Navigation
Siddarth Narasimhan, Aaron Hao Tan, Daniel Choi, Goldie Nejat
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
Bowen Yan, Zhengsong Zhang, Liqiang Jing, Eftekhar Hossain, Xinya Du
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models
Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly
Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case
Peng Chen, Pi Bu, Jun Song, Yuan Gao, Bo Zheng
Vision Language Models Can Parse Floor Plan Maps
David DeFazio, Hrudayangam Mehta, Jeremy Blackburn, Shiqi Zhang
LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
Kosuke Sakurai, Tatsuya Ishii, Ryotaro Shimizu, Linxin Song, Masayuki Goto
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
Mixture of Prompt Learning for Vision Language Models
Yu Du, Tong Niu, Rong Zhao
GauTOAO: Gaussian-based Task-Oriented Affordance of Objects
Jiawen Wang, Dingsheng Luo
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Oriane Siméoni, Matthieu Cord
Constructive Apraxia: An Unexpected Limit of Instructible Vision-Language Models and Analog for Human Cognitive Disorders
David Noever, Samantha E. Miller Noever
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA
Jian Lan, Diego Frassinelli, Barbara Plank
CAST: Cross-modal Alignment Similarity Test for Vision Language Models
Gautier Dagan, Olga Loginova, Anil Batra
KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph
Yanbei Jiang, Krista A. Ehinger, Jey Han Lau