Visual Captioning Model

Visual captioning models aim to automatically generate textual descriptions of images or videos, bridging the gap between visual and linguistic understanding. Current research focuses on improving caption accuracy and addressing limitations like reliance on large labeled datasets (through zero-shot and weakly-supervised approaches) and handling ambiguous viewpoints (via embodied captioning in 3D environments). These advancements leverage techniques such as prompt engineering, graph representation learning, and multimodal interaction between vision and language models to enhance caption quality and reduce hallucinations, impacting fields like image retrieval, accessibility, and multimodal AI.

Papers