Video Paragraph Captioning

Video paragraph captioning (VPC) aims to automatically generate multi-sentence descriptions of long, untrimmed videos, capturing the narrative flow of events. Current research emphasizes developing robust models that handle missing or incomplete data from various modalities (e.g., video, speech, event boundaries), often employing transformer-based architectures and contrastive learning techniques to improve coherence and accuracy. This field is significant for advancing multimodal understanding and has applications in areas such as video summarization, accessibility for visually impaired individuals, and enhancing human-computer interaction.

Papers