Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models
Yuhan Liang, Yijun Li, Yumeng Niu, Qianhe Shen, Hangyu Liu
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection
Andrea Appiani, Cigdem Beyan
Zero-shot Action Localization via the Confidence of Large Vision-Language Models
Josiah Aklilu, Xiaohan Wang, Serena Yeung-Levy
E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model
Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, S.Kevin Zhou
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
Zifeng Zhu, Mengzhao Jia, Zhihan Zhang, Lang Li, Meng Jiang
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, Mahyar Najibi
VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin
Improving Multi-modal Large Language Model through Boosting Vision Capabilities
Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li
H2OVL-Mississippi Vision Language Models Technical Report
Shaikat Galib, Shanshan Wang, Guanshuo Xu, Pascal Pfeiffer, Ryan Chesler, Mark Landry, Sri Satish Ambati
GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models
Aditya Sharma, Aman Dalmia, Mehran Kazemi, Amal Zouaq, Christopher J. Pal
LocateBench: Evaluating the Locating Ability of Vision Language Models
Ting-Rui Chiang, Joshua Robinson, Xinyan Velocity Yu, Dani Yogatama
Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead
Kuleen Sasse, Shan Chen, Jackson Pond, Danielle Bitterman, John Osborne
Trust but Verify: Programmatic VLM Evaluation in the Wild
Viraj Prabhu, Senthil Purushwalkam, An Yan, Caiming Xiong, Ran Xu
Sensitivity of Generative VLMs to Semantically and Lexically Altered Prompts
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad
BlabberSeg: Real-Time Embedded Open-Vocabulary Aerial Segmentation
Haechan Mark Bong, Ricardo de Azambuja, Giovanni Beltrame
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models
Ce Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie