Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna
Finetuning CLIP to Reason about Pairwise Differences
Dylan Sam, Devin Willmott, Joao D. Semedo, J. Zico Kolter
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
Yiyi Tao, Zhuoyue Wang, Hang Zhang, Lun Wang
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli, Mohamed Ali Souibgui, Andrey Barsky, Artemis LLabrés, Marco Bertini, Dimosthenis Karatzas
Visuo-Tactile Zero-Shot Object Recognition with Vision-Language Model
Shiori Ueda, Atsushi Hashimoto, Masashi Hamaya, Kazutoshi Tanaka, Hideo Saito
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha, Vinija Jain, Aman Chadha
Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations
Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain, Ahmed Imteaj
MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving
Enming Zhang, Xingyuan Dai, Yisheng Lv, Qinghai Miao
Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations
Keumgang Cha, Donggeun Yu, Junghoon Seo