Visual Captioning

Visual captioning aims to automatically generate natural language descriptions of images or videos, bridging the gap between visual and textual modalities. Current research emphasizes improving caption detail, coherence across multiple images, and handling diverse visual inputs like 3D scenes and visualizations, often employing transformer-based architectures and leveraging large language models for enhanced contextual understanding and fact-checking. These advancements are driving progress in various applications, including image retrieval, visual question answering, and assistive technologies for visually impaired individuals. The development of high-quality datasets and robust evaluation metrics is also a key focus, enabling more reliable benchmarking and comparison of different approaches.

Papers