Visual Captioning Model
Visual captioning models aim to automatically generate textual descriptions of images or videos, bridging the gap between visual and linguistic understanding. Current research focuses on improving caption accuracy and addressing limitations like reliance on large labeled datasets (through zero-shot and weakly-supervised approaches) and handling ambiguous viewpoints (via embodied captioning in 3D environments). These advancements leverage techniques such as prompt engineering, graph representation learning, and multimodal interaction between vision and language models to enhance caption quality and reduce hallucinations, impacting fields like image retrieval, accessibility, and multimodal AI.
Papers
August 25, 2023
August 21, 2023
July 19, 2023
January 12, 2023