Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
MBQ: Modality-Balanced Quantization for Large Vision-Language Models
Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation
Chengyang Ye, Yunzhi Zhuge, Pingping Zhang
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
Pooyan Rahmanzadehgrevi, Hung Huy Nguyen, Rosanne Liu, Long Mai, Anh Totti Nguyen
Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model
Tenghui Li, Guoxu Zhou, Xuyang Zhao, Qibin Zhao
Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model
Yushu Li, Yongyi Su, Adam Goodge, Kui Jia, Xun Xu
Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight
Xi Ding, Lei Wang
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal
ChatGarment: Garment Estimation, Generation and Editing via Large Language Models
Siyuan Bian, Chenghao Xu, Yuliang Xiu, Artur Grigorev, Zhen Liu, Cewu Lu, Michael J. Black, Yao Feng
Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
Yitong Chen, Wenhao Yao, Lingchen Meng, Sihong Wu, Zuxuan Wu, Yu-Gang Jiang
Reasoning to Attend: Try to Understand How <SEG> Token Works
Rui Qian, Xin Yin, Dejing Dou
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho
On the Feasibility of Vision-Language Models for Time-Series Classification
Vinay Prithyani, Mohsin Mohammed, Richa Gadgil, Ricardo Buitrago, Vinija Jain, Aman Chadha
GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning
Teja Krishna Cherukuri, Nagur Shareef Shaik, Jyostna Devi Bodapati, Dong Hye Ye
HyperCLIP: Adapting Vision-Language models with Hypernetworks
Victor Akinwande, Mohammad Sadegh Norouzzadeh, Devin Willmott, Anna Bair, Madan Ravi Ganesh, J. Zico Kolter
REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation
Xizhe Xue, Guoting Wei, Hao Chen, Haokui Zhang, Feng Lin, Chunhua Shen, Xiao Xiang Zhu
Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding
Junyi Ye, Ankan Dash, Wenpeng Yin, Guiling Wang
Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities
Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, Jingdong Wang