Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models
Ammar N. Abbas, Csaba Beleznai
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, Salman Khan
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space
Yonghao He, Hu Su, Haiyong Yu, Cong Yang, Wei Sui, Cong Wang, Song Liu
HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model
Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue
GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
Saumya Saxena, Blake Buchanan, Chris Paxton, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, Xin Huang
Surrealistic-like Image Generation with Vision-Language Models
Elif Ayten, Shuai Wang, Hjalmar Snoep
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
Ido Cohen, Daniela Gottesman, Mor Geva, Raja Giryes
Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition
Ethan Baron, Idan Tankel, Peter Tu, Guy Ben-Yosef
Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation
Changsun Lee, Sangjoon Park, Cheong-Il Shin, Woo Hee Choi, Hyun Jeong Park, Jeong Eun Lee, Jong Chul Ye
PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models
Biao Liu, Wenyi Fang, Xiaoyu Wu, Yang Zheng, Zheng Hu, Bo Yuan
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Qitong Wang, Tang Li, Kien X. Nguyen, Xi Peng
FastVLM: Efficient Vision Encoding for Vision Language Models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction
Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
Improving Fine-grained Visual Understanding in VLMs through Text-Only Training
Dasol Choi, Guijin Son, Soo Yong Kim, Gio Paik, Seunghyeok Hong
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, Libo Qin
SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models
Wenyu Zhang, Wei En Ng, Lixin Ma, Yuwen Wang, Jungqi Zhao, Boyang Li, Lu Wang
DuSSS: Dual Semantic Similarity-Supervised Vision-Language Model for Semi-Supervised Medical Image Segmentation
Qingtao Pan, Wenhao Qiao, Jingjiao Lou, Bing Ji, Shuo Li