Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
Papers
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox
Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data
Jiahan Zhang, Qi Wei, Feng Liu, Lei Feng
From Pixels to Prose: A Large Dataset of Dense Image Captions
Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein
Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding
Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
DevBench: A multimodal developmental benchmark for language learning
Alvin Wei Ming Tan, Sunny Yu, Bria Long, Wanjing Anya Ma, Tonya Murray, Rebecca D. Silverman, Jason D. Yeatman, Michael C. Frank
CarLLaVA: Vision language models for camera-only closed-loop driving
Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski
RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model
Hantao Zhou, Tianying Ji, Lukas Sommerhalder, Michael Goerner, Norman Hendrich, Jianwei Zhang, Fuchun Sun, Huazhe Xu
Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
Xiaowen Sun, Xufeng Zhao, Jae Hee Lee, Wenhao Lu, Matthias Kerzel, Stefan Wermter
Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment
Fei Zhou, Zhicong Huang, Tianhao Gu, Guoping Qiu
Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting
Ce Hao, Kelvin Lin, Siyuan Luo, Harold Soh
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Karthik Nandakumar, Ivan Laptev
Generative AI-based Prompt Evolution Engineering Design Optimization With Vision-Language Model
Melvin Wong, Thiago Rios, Stefan Menzel, Yew Soon Ong
How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models
Tarun Khajuria, Braian Olmiro Dias, Jaan Aru
Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency
Maor Dikter, Tsachi Blau, Chaim Baskin
Advancing High Resolution Vision-Language Models in Biomedicine
Zekai Chen, Arda Pekis, Kevin Brown
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky
A3VLM: Actionable Articulation-Aware Vision Language Model
Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, Hongsheng Li
Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph
Sergey Linok, Tatiana Zemskova, Svetlana Ladanova, Roman Titkov, Dmitry Yudin, Maxim Monastyrny, Aleksei Valenkov