Dense Video Captioning

Dense video captioning aims to automatically generate detailed, temporally localized descriptions of events within untrimmed videos. Current research emphasizes improving the accuracy and efficiency of caption generation, particularly focusing on online (real-time) captioning and leveraging large language models (LLMs) and pre-trained vision-language models for efficient adaptation to video data. This field is significant for advancing video understanding and has applications in areas such as accessibility, video summarization, and automated content analysis, driving progress in both computer vision and natural language processing.

Papers