Large Vision Language Model
Large Vision-Language Models (LVLMs) integrate computer vision and natural language processing to enable machines to understand and reason about images and text simultaneously. Current research focuses on improving LVLMs' accuracy, efficiency, and robustness, particularly addressing issues like hallucinations (generating inaccurate information), and enhancing their ability to perform multi-level visual perception and reasoning tasks, including quantitative spatial reasoning and mechanical understanding. These advancements are significant for various applications, including medical image analysis, robotics, and autonomous driving, by enabling more reliable and insightful multimodal data processing.
Papers
Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond
Tianxin Wei, Bowen Jin, Ruirui Li, Hansi Zeng, Zhengyang Wang, Jianhui Sun, Qingyu Yin, Hanqing Lu, Suhang Wang, Jingrui He, Xianfeng Tang
Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning
Dongmin Park, Zhaofang Qian, Guangxing Han, Ser-Nam Lim
XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization
Yequan Bie, Luyang Luo, Zhixuan Chen, Hao Chen
AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions
Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould
AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models
Yifei Gao, Jiaqi Wang, Zhiyu Lin, Jitao Sang
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
Long Lan, Fengxiang Wang, Shuyan Li, Xiangtao Zheng, Zengmao Wang, Xinwang Liu
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Yezi Liu, Fei Wen, Alvaro Velasquez, Hugo Latapie, Mohsen Imani
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark
Han Huang, Haitian Zhong, Tao Yu, Qiang Liu, Shu Wu, Liang Wang, Tieniu Tan
Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
Xinyao Li, Yuke Li, Zhekai Du, Fengling Li, Ke Lu, Jingjing Li
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang
Can LLMs' Tuning Methods Work in Medical Multimodal Domain?
Jiawei Chen, Yue Jiang, Dingkang Yang, Mingcheng Li, Jinjie Wei, Ziyun Qian, Lihua Zhang
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, Jiawei Zhou
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, Qi Liu