Large Vision Language Model
Large Vision-Language Models (LVLMs) integrate computer vision and natural language processing to enable machines to understand and reason about images and text simultaneously. Current research focuses on improving LVLMs' accuracy, efficiency, and robustness, particularly addressing issues like hallucinations (generating inaccurate information), and enhancing their ability to perform multi-level visual perception and reasoning tasks, including quantitative spatial reasoning and mechanical understanding. These advancements are significant for various applications, including medical image analysis, robotics, and autonomous driving, by enabling more reliable and insightful multimodal data processing.
Papers
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models
Chen Ju, Haicheng Wang, Zeqian Li, Xu Chen, Zhonghua Zhai, Weilin Huang, Shuai Xiao
TransMed: Large Language Models Enhance Vision Transformer for Biomedical Image Classification
Kaipeng Zheng, Weiran Huang, Lichao Sun
Domain Prompt Learning with Quaternion Networks
Qinglong Cao, Zhengqin Xu, Yuntian Chen, Chao Ma, Xiaokang Yang
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
Lei Wang, Jiabang He, Shenshen Li, Ning Liu, Ee-Peng Lim
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, Xu Yang
RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance
Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, Matthias Keicher
Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models
Dong Li, Jiandong Jin, Yuhao Zhang, Yanlin Zhong, Yaoyang Wu, Lan Chen, Xiao Wang, Bin Luo
The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding
Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, Fabrizio Falchi
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback
Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, Cyrus Rashtchian