Multimodal Hierarchical Multimedia Summarization

Multimodal hierarchical multimedia summarization aims to create concise, informative summaries of multimedia content (e.g., videos) that integrate visual and textual information. Current research focuses on developing models that effectively combine visual and textual data, often employing large language models and incorporating techniques like optimal transport to align different modalities and generate coherent keyframe-caption pairs or video-text summaries. This field is significant because it addresses the growing need for efficient methods to navigate and understand large volumes of multimedia data, with applications ranging from improved video search to automated content generation.

Papers