Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao, Mi Zhang, Bingnan Yang, Zhan Zhang, Jiaju Kang, Jianya Gong
GAgent: An Adaptive Rigid-Soft Gripping Agent with Vision Language Models for Complex Lighting Environments
Zhuowei Li, Miao Zhang, Xiaotian Lin, Meng Yin, Shuai Lu, Xueqian Wang
MeDSLIP: Medical Dual-Stream Language-Image Pre-training for Fine-grained Alignment
Wenrui Fan, Mohammod Naimul Islam Suvon, Shuo Zhou, Xianyuan Liu, Samer Alabed, Venet Osmani, Andrew Swift, Chen Chen, Haiping Lu
Leveraging vision-language models for fair facial attribute classification
Miao Zhang, Rumi Chunara
Reconfigurable Robot Identification from Motion Data
Yuhang Hu, Yunzhe Wang, Ruibo Liu, Zhou Shen, Hod Lipson
EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
Rocktim Jyoti Das, Simeon Emilov Hristov, Haonan Li, Dimitar Iliyanov Dimitrov, Ivan Koychev, Preslav Nakov
Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models
Tian Meng, Yang Tao, Ruilin Lyu, Wuliang Yin
CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning
Yukun Li, Guansong Pang, Wei Suo, Chenchen Jing, Yuling Xi, Lingqiao Liu, Hao Chen, Guoqiang Liang, Peng Wang
An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models
Haochen Luo, Jindong Gu, Fengyuan Liu, Philip Torr
PosSAM: Panoptic Open-vocabulary Segment Anything
Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Haiwen Huang, Songyou Peng, Dan Zhang, Andreas Geiger
Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Hugo Laurençon, Léo Tronchon, Victor Sanh
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan
Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery
Xavier Bou, Gabriele Facciolo, Rafael Grompone von Gioi, Jean-Michel Morel, Thibaud Ehret