Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
M. Maruf, Arka Daw, Kazi Sajeed Mehrab, Harish Babu Manogaran, Abhilash Neog, Medha Sawhney, Mridul Khurana, James P. Balhoff, Yasin Bakis, Bahadir Altintas, Matthew J. Thompson, Elizabeth G. Campolongo, Josef C. Uyeda, Hilmar Lapp, Henry L. Bart, Paula M. Mabee, Yu Su, Wei-Lun Chao, Charles Stewart, Tanya Berger-Wolf, Wasila Dahdul, Anuj Karpatne
Visual Prompt Engineering for Medical Vision Language Models in Radiology
Stefan Denner, Markus Bujotzek, Dimitrios Bounias, David Zimmerer, Raphael Stock, Paul F. Jäger, Klaus Maier-Hein
Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail
Bianca Lamm, Janis Keuper
Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis
Sakhinana Sagar Srinivas, Geethan Sannidhi, Venkataramana Runkana
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
Aishik Nagar, Shantanu Jaiswal, Cheston Tan
CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP
Zhenchen Tang, Zichuan Wang, Bo Peng, Jing Dong
VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities
Shusaku Egami, Takahiro Ugai, Swe Nwe Nwe Htun, Ken Fukuda
HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling
Yubin Wang, Xinyang Jiang, De Cheng, Wenli Sun, Dongsheng Li, Cairong Zhao
MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation
Yuanbing Zhu, Bingke Zhu, Zhen Chen, Huan Xu, Ming Tang, Jinqiao Wang
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang
Social perception of faces in a vision-language model
Carina I. Hausladen, Manuel Knott, Colin F. Camerer, Pietro Perona
More Pictures Say More: Visual Intersection Network for Open Set Object Detection
Bingcheng Dong, Yuning Ding, Jinrong Zhang, Sifan Zhang, Shenglan Liu
Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
Shuai Fu, Xiequn Wang, Qiushi Huang, Yu Zhang
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid
Enabling Small Models for Zero-Shot Classification through Model Label Learning
Jia Zhang, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li
Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using Omnidirectional Camera and Multiple Vision-Language Models
Kento Kawaharazuka, Yoshiki Obinata, Naoaki Kanazawa, Naoto Tsukamoto, Kei Okada, Masayuki Inaba