Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli, Andrey Barsky, Mohamed Ali Souibgui, Artemis LLabres, Marco Bertini, Dimosthenis Karatzas
Visuo-Tactile Zero-Shot Object Recognition with Vision-Language Model
Shiori Ueda, Atsushi Hashimoto, Masashi Hamaya, Kazutoshi Tanaka, Hideo Saito
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha, Vinija Jain, Aman Chadha
Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations
Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain, Ahmed Imteaj
MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving
Enming Zhang, Xingyuan Dai, Yisheng Lv, Qinghai Miao
Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations
Keumgang Cha, Donggeun Yu, Junghoon Seo