Paper ID: 2403.14401

Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination

Dingchen Yang, Bowen Cao, Guang Chen, Changjun Jiang

Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs oblivious to the accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may equally advocate both accurate and erroneous content. To address this issue, we propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination. Specifically, Pensieve enables MLLMs to retrospect relevant images as references and compare their visual content with the test image via confidence score subtraction. Moreover, our paradigm balances the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores. Experiments on Whoops, LLaVA Bench, POPE, and MME demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Pensieve also aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.

Submitted: Mar 21, 2024