Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
Exploiting LMM-based knowledge for image classification tasks
Maria Tzelepi, Vasileios Mezaris
Balancing Performance and Efficiency in Zero-shot Robotic Navigation
Dmytro Kuzmenko, Nadiya Shvai
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, Feng Liu
Boosting Vision-Language Models with Transduction
Maxime Zanella, Benoît Gérin, Ismail Ben Ayed
ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models
Thanh-Dat Truong, Xin Li, Bhiksha Raj, Jackson Cothren, Khoa Luu
ATTIQA: Generalizable Image Quality Feature Extractor using Attribute-aware Pretraining
Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim
MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4
Vahid Azizi, Fatemeh Koochaki
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond
Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, Jingdong Wang
InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding
Huaxiang Zhang, Yaojia Mu, Guo-Niu Zhu, Zhongxue Gan
Information Theoretic Text-to-Image Alignment
Chao Wang, Giulio Franzese, Alessandro Finamore, Massimo Gallo, Pietro Michiardi
Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images
Mansi Kakkar, Dattesh Shanbhag, Chandan Aladahalli, Gurunath Reddy M
OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation
Gonca Yilmaz, Songyou Peng, Marc Pollefeys, Francis Engelmann, Hermann Blum
Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training
Aisha Urooj Khan, John Garrett, Tyler Bradshaw, Lonie Salkowski, Jiwoong Jason Jeong, Amara Tariq, Imon Banerjee