Long Form Video
Long-form video understanding aims to develop computational methods for effectively analyzing and interpreting videos exceeding typical short-clip lengths, addressing challenges in processing extensive temporal information and extracting high-level semantic concepts. Current research focuses on improving efficiency and accuracy through techniques like hierarchical memory mechanisms, multimodal fusion (combining visual, audio, and textual data), and the adaptation of large language models (LLMs) and vision-language models (VLMs) for tasks such as question answering, summarization, and temporal action localization. This field is crucial for advancing applications requiring comprehensive video analysis, including video search, content creation, and assistive technologies for the visually impaired.
Papers
HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu
Open-Vocabulary Action Localization with Iterative Visual Prompting
Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi