Vision Language Model
Vision-language models (VLMs) integrate visual and textual information to perform complex tasks, aiming to bridge the gap between computer vision and natural language processing. Current research focuses on improving VLM efficiency and robustness through techniques like prompt tuning, which optimizes textual or visual prompts for specific tasks, and sparse token optimization to reduce computational overhead. These advancements are significant because they enable VLMs to be applied to diverse real-world applications, including robotics, autonomous driving, medical image analysis, and fake news detection, while addressing challenges like hallucinations and model miscalibration.
1067papers
Papers - Page 2
April 2, 2025
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs
Mothilal Asokan, Kebin Wu, Fatima AlbreikiTechnology Innovation Institute (TII)Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Shreyank N Gowda, Boyan Gao, Xiao Gu, Xiaobo JinUniversity of Nottingham●University of Oxford●Xi’an Jiaotong-Liverpool UniversityPrompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images
Nusrat Munia, Abdullah-Al-Zubaer ImranUniversity of KentuckyBlenderGym: Benchmarking Foundational Model Systems for Graphics Editing
Yunqi Gu, Ian Huang, Jihyeon Je, Guandao Yang, Leonidas GuibasStanford UniversityReasoning LLMs for User-Aware Multimodal Conversational Agents
Hamed Rahimi, Jeanne Cattoni, Meriem Beghili, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed ChetouaniSorbonne Université●Université Paris Cité●Assistance Publique – Hôpitaux de Paris (AP-HP)Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models
Zhaochen Wang, Yujun Cai, Zi Huang, Bryan Hooi, Yiwei Wang, Ming-Hsuan YangThe University of Queensland●National University of Singapore●University of California at MercedSafeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang, Yushen Zuo, Yuanjun Chai, Zhendong Liu, Yichen Fu, Yichun Feng, Kin-man LamUniversity of Science and Technology of China●The Hong Kong Polytechnic University●University of Washington●Nanjing University●Stanford...+2
April 1, 2025
TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images
Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan JiangAI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
Kristen M. Edwards, Farnaz Tehranchi, Scarlett R. Miller, Faez AhmedMassachusetts Institute of Technology●The Pennsylvania State University
March 31, 2025
Self-Evolving Visual Concept Library using Vision-Language Critics
Atharva Sehgal, Patrick Yuan, Ziniu Hu, Yisong Yue, Jennifer J. Sun, Swarat ChaudhuriUniversity of Texas at Austin●Cornell University●California Institute of TechnologyTexture or Semantics? Vision-Language Models Get Lost in Font Recognition
Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei WangUniversity of California●ByteDance●The University of Queensland●University of Southern California●University at BuffaloKOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Yoonshik Kim, Jaeyoon JungMAUM AI Inc.
March 30, 2025
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Maximilian Augustin, Yannic Neuhaus, Matthias HeinTübingen AI Center – University of TübingenBiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation
Rafi Ibn Sultan, Hui Zhu, Chengyin Li, Dongxiao ZhuWayne State UniversityRe-Aligning Language to Visual Objects with an Agentic Workflow
Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing SongNankai University●SenseTime ResearchEvolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies in Vision-Language Models
Sid Bharthulwar, John Rho, Katrina BrownHarvard CollegeCOSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Fanding Huang, Jingyan Jiang, Qinting Jiang, Hebei Li, Faisal Nadeem Khan, Zhi WangTsinghua University●Shenzhen Technology University●University of Science and Technology of ChinaFrom Panels to Prose: Generating Literary Narratives from Comics
Ragav Sachdeva, Andrew ZissermanUniversity of OxfordReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, Xiangyang XueFudan University●Shanghai Innovation Institute●Nanyang Technological University●Ltd