Caption Editing

Caption editing focuses on improving the accuracy, fluency, and informativeness of image and video captions, primarily by leveraging large vision-language models (LVLMs) and diffusion mechanisms. Current research emphasizes mitigating hallucinations (incorrect details in generated captions), enhancing generalization capabilities across diverse datasets, and developing explainable editing methods that mimic human-like revisions through explicit edit operations. These advancements are significant for improving the quality and reliability of multimodal data, impacting applications such as image retrieval, visual question answering, and accessible multimedia content creation.

Papers