Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Harsh Mahesheka, Zhixian Xie, Zhaoran Wang, Wanxin Jin
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu, Chao Shang, Ling Liu, Nikolaos Pappas, Jie Ma, Neha Anna John, Srikanth Doss, Lluis Marquez, Miguel Ballesteros, Yassine Benajiba
Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation
Kun Ding, Qiang Yu, Haojian Zhang, Gaofeng Meng, Shiming Xiang
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models
Jingyuan Qi, Zhiyang Xu, Rulin Shao, Yang Chen, Jin Di, Yu Cheng, Qifan Wang, Lifu Huang
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model
Beichen Wang, Juexiao Zhang, Shuwen Dong, Irving Fang, Chen Feng
Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models
Reza Abbasi, Sernam Lim
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
Mengyuan Chen, Junyu Gao, Changsheng Xu
Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon
Q-VLM: Post-training Quantization for Large Vision-Language Models
Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu
Unsupervised Data Validation Methods for Efficient Model Training
Yurii Paniv
HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter
Yumiao Zhao, Bo Jiang, Xiao Wang, Qin Xu, Jin Tang
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks
Hoin Jung, Taeuk Jang, Xiaoqian Wang
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
Seongyun Lee, Geewook Kim, Jiyeon Kim, Hyunji Lee, Hoyeon Chang, Sue Hyun Park, Minjoon Seo
Towards Interpreting Visual Information Processing in Vision-Language Models
Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez
VHELM: A Holistic Evaluation of Vision Language Models
Tony Lee, Haoqin Tu, Chi Heem Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin Somerville Roberts, Michihiro Yasunaga, Huaxiu Yao, Cihang Xie, Percy Liang
Pixtral 12B
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychnikova, Marie Pellat, Patrick Von Platen, Nikhil Raghuraman, Baptiste Rozière, Alexandre Sablayrolles, Lucile Saulnier, Romain Sauvestre, Wendy Shang, Roman Soletskyi, Lawrence Stewart, Pierre Stock, Joachim Studnia, Sandeep Subramanian, Sagar Vaze, Thomas Wang
Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback
Dennis Hein, Zhihong Chen, Sophie Ostmeier, Justin Xu, Maya Varma, Eduardo Pontes Reis, Arne Edward Michalson, Christian Bluethgen, Hyun Joo Shin, Curtis Langlotz, Akshay S Chaudhari
Compositional Entailment Learning for Hyperbolic Vision-Language Models
Avik Pal, Max van Spengler, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, Pascal Mettes