Video Summarization
Video summarization aims to automatically condense lengthy video content into concise, informative summaries, either as shorter videos or textual descriptions, preserving key information and user relevance. Current research emphasizes multimodal approaches, integrating visual and audio features with large language models (LLMs) and transformer-based architectures, often employing techniques like attention mechanisms, graph representations, and efficient token mixing to improve both accuracy and computational efficiency. This field is crucial for managing the ever-increasing volume of video data, impacting diverse applications from social media and education to surveillance and personalized content delivery. The development of more efficient and accurate summarization methods is driving advancements in both computer vision and natural language processing.
Papers
Video-CSR: Complex Video Digest Creation for Visual-Language Models
Tingkai Liu, Yunzhe Tao, Haogeng Liu, Qihang Fan, Ding Zhou, Huaibo Huang, Ran He, Hongxia Yang
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling
Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang