Large Vision Language Model
Large Vision-Language Models (LVLMs) integrate computer vision and natural language processing to enable machines to understand and reason about images and text simultaneously. Current research focuses on improving LVLMs' accuracy, efficiency, and robustness, particularly addressing issues like hallucinations (generating inaccurate information), and enhancing their ability to perform multi-level visual perception and reasoning tasks, including quantitative spatial reasoning and mechanical understanding. These advancements are significant for various applications, including medical image analysis, robotics, and autonomous driving, by enabling more reliable and insightful multimodal data processing.
Papers
HumanVLM: Foundation for Human-Scene Vision-Language Model
Dawei Dai, Xu Long, Li Yutang, Zhang Yuanhui, Shuyin Xia
Membership Inference Attacks against Large Vision-Language Models
Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, Volkan Cevher
DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
Haodong Li, Haicheng Qu, Xiaofeng Zhang
V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization
Yuxi Xie, Guanzhen Li, Xiao Xu, Min-Yen Kan
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
Han Bao, Yue Huang, Yanbo Wang, Jiayi Ye, Xiangqi Wang, Xiuyin Chen, Mohamed Elhoseiny, Xiangliang Zhang
Zero-Shot Action Recognition in Surveillance Videos
Joao Pereira, Vasco Lopes, David Semedo, Joao Neves
Improving Generalization in Visual Reasoning via Self-Ensemble
Tien-Huy Nguyen, Quang-Khai Tran, Anh-Tuan Quang-Hoang
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
Yucheng Zhou, Zhi Rao, Jun Wan, Jianbing Shen
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
Fengbin Zhu, Ziyang Liu, Xiang Yao Ng, Haohui Wu, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, Tat Seng Chua
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang