Large Vision Language Model

Large Vision-Language Models (LVLMs) integrate computer vision and natural language processing to enable machines to understand and reason about images and text simultaneously. Current research focuses on improving LVLMs' accuracy, efficiency, and robustness, particularly addressing issues like hallucinations (generating inaccurate information), and enhancing their ability to perform multi-level visual perception and reasoning tasks, including quantitative spatial reasoning and mechanical understanding. These advancements are significant for various applications, including medical image analysis, robotics, and autonomous driving, by enabling more reliable and insightful multimodal data processing.

Papers

September 28, 2024

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models
Hao Chen, Wei Zhao, Yingli Li, Tianyang Zhong, Yisong Wang, Youlan Shang, Lei Guo, Junwei Han, Tianming Liu, Jun Liu, Tuo Zhang
Large Vision Language Model Radiology Report Generation 3D Medical Image Medical Image Data 3d Ct

September 25, 2024

September 24, 2024

September 23, 2024

September 22, 2024

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization
Minyi Zhao, Jie Wang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Shuigeng Zhou
Large Vision Language Model Mitigating Hallucination Hallucination Evaluation Adaptive Prompt Prompt Augmentation

September 21, 2024

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
Jiashuo Sun, Jihai Zhang, Yucheng Zhou, Zhaochen Su, Xiaoye Qu, Yu Cheng
Natural Language Processing Retrieval Augmented Generation Large Vision Language Model Multi Modal

September 20, 2024

FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
Bowen Yan, Zhengsong Zhang, Liqiang Jing, Eftekhar Hossain, Xinya Du
Vision Language Model Large Vision Language Model Scene Graph Hallucination Evaluation Language Model Hallucination

September 18, 2024

September 17, 2024

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA
Jian Lan, Diego Frassinelli, Barbara Plank
Vision Language Model Large Vision Language Model Visual Question Answering High Uncertainty Anticipation Human Evaluation User Response Model Prediction Multi Annotator Human Disagreement

September 15, 2024

September 11, 2024

Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain, Ahmed Imteaj
Adversarial Attack Vision Language Model Large Vision Language Model Jailbreak Attack Adversarial Manipulation CLIP Vision Encoder Gradient Based Adversarial Robust Encoders

September 8, 2024

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions
Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Yu Wang
Adversarial Attack Adversarial Example Large Vision Language Model Linear Probing Attention Pattern

September 5, 2024

Have Large Vision-Language Models Mastered Art History?
Ombretta Strafforello, Derya Soydaner, Michiel Willems, Anne-Sofie Maerten, Stefanie De Winter
Image Classification Large Vision Language Model Art Specific Information

Large Vision Language Model

Papers

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

Attention Prompting on Image for Large Vision-Language Models

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

ReLEP: A Novel Framework for Real-world Long-horizon Embodied Planning

VLMine: Long-Tail Data Mining with Vision Language Models

Behavioral Bias of Vision-Language Models: A Behavioral Finance View

A-VL: Adaptive Attention for Large Vision-Language Models

Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information

FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Navigation with VLM framework: Go to Any Language

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA

FSL-LVLM: Friction-Aware Safety Locomotion using Large Vision Language Model in Wheeled Robots

Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models

Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

Have Large Vision-Language Models Mastered Art History?