Multimodal Hallucination

Multimodal hallucination refers to the generation of inaccurate or fabricated information by large vision-language models (LVLMs) that combines visual and textual data. Current research focuses on understanding the underlying causes of these hallucinations, developing methods to detect them (often using novel metrics and datasets), and mitigating their occurrence through techniques like hierarchical feedback learning, data filtering, and self-supervised revision mechanisms. This work is crucial for improving the reliability and trustworthiness of LVLMs, impacting various applications from medical diagnosis to question answering systems, where factual accuracy is paramount.

Papers