Long Form Video Understanding

Long-form video understanding aims to enable computers to comprehend and reason about the content of videos exceeding typical short-clip lengths, addressing the challenges posed by extended temporal contexts and vast amounts of visual data. Current research focuses on adapting large language models (LLMs) and multimodal large language models (MLLMs) to process long videos, often employing techniques like hierarchical representations, memory mechanisms (e.g., memory banks, recurrent memory bridges), and adaptive token selection to manage computational costs and improve efficiency. This field is crucial for advancing applications such as video summarization, event detection, and question answering in long videos, driving progress in both computer vision and natural language processing.

Papers