Long Form Video

Long-form video understanding aims to develop computational methods for effectively analyzing and interpreting videos exceeding typical short-clip lengths, addressing challenges in processing extensive temporal information and extracting high-level semantic concepts. Current research focuses on improving efficiency and accuracy through techniques like hierarchical memory mechanisms, multimodal fusion (combining visual, audio, and textual data), and the adaptation of large language models (LLMs) and vision-language models (VLMs) for tasks such as question answering, summarization, and temporal action localization. This field is crucial for advancing applications requiring comprehensive video analysis, including video search, content creation, and assistive technologies for the visually impaired.

Papers

June 18, 2024

DrVideo: Document Retrieval Based Long Video Understanding
Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai
Long Form Video Video Benchmark Long Video Understanding Long Video Retrieval

June 13, 2024

Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, Michael S. Ryoo
Video Question Long Form Video Simple Strategy

April 27, 2024

VIEW: Visual Imitation Learning with Waypoints
Ananth Jonnavittula, Sagar Parekh, Dylan P. Losey
Video Demonstration Long Form Video Visual Imitation Learning

April 24, 2024

Learning Long-form Video Prior via Generative Pre-Training
Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou
Generative Pre Training Long Form Video

April 18, 2024

LongEmbed: Extending Embedding Models for Long Context Retrieval
Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
Full Model Context Learning Long Context NLP Application Long Form Video

March 20, 2024

Multimodal Chaptering for Long-Form TV Newscast Video
Khalil Guetari, Yannis Tevissen, Frédéric Petitpont
Audio Visual Multimodal Phenomenon LSTM Network Long Form Video Broadcasting Support Relation Chapter to Chapter

March 19, 2024

Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, Limin Wang
Context Information Long Form Video Video Context Audio Description Interleaved Multimodal

February 25, 2024

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, Zilong Zheng
Spatio Temporal Long Form Video Video Language Pre Training Temporal Prompt Spatial Prompt Video Language Modeling

February 21, 2024

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs
Yunxin Li, Xinyu Chen, Baotain Hu, Min Zhang
Large Language Model Video Understanding Long Form Video Long Video Understanding Task Specific Adapter Video Understanding Benchmark

January 29, 2024

Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas
Carlo Bretti, Pascal Mettes, Hendrik Vincent Koops, Daan Odijk, Nanne van Noord
Multimodal Information TV Show Long Form Video Story Plot Trailer Generation Trailer Worthy Moment

January 23, 2024

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)
Shih-Han Chou, Matthew Kowal, Yasmin Niknam, Diana Moyano, Shayaan Mehdi, Richard Pito, Cheng Zhang, Ian Knopke, Sedef Akinli Kocak, Leonid Sigal, Yalda Mohsenzadeh
Long Form Video News Dataset Multi Modal Understanding Video Language Understanding News Video

December 28, 2023

A Simple LLM Framework for Long-Range Video Question-Answering
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius
Video Question Answering Long Range Long Form Video Understanding Long Form Video

December 11, 2023

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius
Grounding Network Long Video Temporal Grounding Long Form Video Moment Retrieval

October 30, 2023

MM-VID: Advancing Video Understanding with GPT-4V(ision)
Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang
Video Understanding Text to Video Generation Long Form Video Multimodal Comprehension

October 15, 2023

LOVECon: Text-driven Training-Free Long Video Editing with ControlNet
Zhenyi Liao, Zhijie Deng
Video Editing Video Frame Interpolation Long Form Video Training Free Diffusion

September 26, 2023

Memory-Efficient Continual Learning Object Segmentation for Long Video
Amir Nazemi, Mohammad Javad Shafiee, Zahra Gharaee, Paul Fieguth
Continual Learning Long Video Semi Supervised Video Object Segmentation Long Form Video Continual Semantic Segmentation Representation Drift Regularization Based Continual Learning

September 20, 2023

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Mohamed Afham, Satya Narayan Shukla, Omid Poursaeed, Pengchuan Zhang, Ashish Shah, Sernam Lim
Long Form Video Understanding Long Form Video Temporal Segmentation Video Understanding Model Adaptive Tokenization

March 25, 2023

Selective Structured State-Spaces for Long-Form Video Understanding
Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, Raffay Hamid
Video Understanding Temporal Dependency Selective State Space Long Form Video Understanding Long Form Video

October 10, 2022

Turbo Training with Token Dropout
Tengda Han, Weidi Xie, Andrew Zisserman
Video Classification Faster Training Long Form Video Video Language Representation Token Dropping

September 22, 2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
Temporal Grounding Long Form Video Multi Modal Alignment Video Temporal Grounding Shadow Cone Coarse to Fine Alignment