Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving
Yongjie Fu, Anmol Jain, Xuan Di, Xu Chen, Zhaobin Mo
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning
Zhengqing Gao, Xiang Ao, Xu-Yao Zhang, Cheng-Lin Liu
Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation
Vivek Myers, Bill Chunyuan Zheng, Oier Mees, Sergey Levine, Kuan Fang
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
Jingyi Wang, Jianzhong Ju, Jian Luan, Zhidong Deng
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
M. Maruf, Arka Daw, Kazi Sajeed Mehrab, Harish Babu Manogaran, Abhilash Neog, Medha Sawhney, Mridul Khurana, James P. Balhoff, Yasin Bakis, Bahadir Altintas, Matthew J. Thompson, Elizabeth G. Campolongo, Josef C. Uyeda, Hilmar Lapp, Henry L. Bart, Paula M. Mabee, Yu Su, Wei-Lun Chao, Charles Stewart, Tanya Berger-Wolf, Wasila Dahdul, Anuj Karpatne
Visual Prompt Engineering for Medical Vision Language Models in Radiology
Stefan Denner, Markus Bujotzek, Dimitrios Bounias, David Zimmerer, Raphael Stock, Paul F. Jäger, Klaus Maier-Hein
Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail
Bianca Lamm, Janis Keuper
Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis
Sakhinana Sagar Srinivas, Geethan Sannidhi, Venkataramana Runkana
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
Aishik Nagar, Shantanu Jaiswal, Cheston Tan
CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP
Zhenchen Tang, Zichuan Wang, Bo Peng, Jing Dong
VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities
Shusaku Egami, Takahiro Ugai, Swe Nwe Nwe Htun, Ken Fukuda
HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling
Yubin Wang, Xinyang Jiang, De Cheng, Wenli Sun, Dongsheng Li, Cairong Zhao
MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation
Yuanbing Zhu, Bingke Zhu, Yingying Chen, Yunfang Niu, Ming Tang, Jinqiao Wang
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang
Social perception of faces in a vision-language model
Carina I. Hausladen, Manuel Knott, Colin F. Camerer, Pietro Perona
More Pictures Say More: Visual Intersection Network for Open Set Object Detection
Bingcheng Dong, Yuning Ding, Jinrong Zhang, Sifan Zhang, Shenglan Liu
Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
Shuai Fu, Xiequn Wang, Qiushi Huang, Yu Zhang