Video Understanding Task

Video understanding research aims to enable computers to interpret the content and context of videos, encompassing tasks like action recognition, video captioning, and question answering. Current efforts focus on developing robust and efficient models, often leveraging large language models (LLMs) and multimodal architectures, including transformers and graph neural networks, to process both visual and auditory information and handle long-term temporal dependencies. These advancements are crucial for applications ranging from automated video indexing and summarization to more complex tasks such as autonomous driving and medical diagnosis, driving significant progress in both computer vision and artificial intelligence.

Papers

May 6, 2024

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan
High Dimensional Video Understanding Robustness Evaluation Video Understanding Task Large Multi Modal Model Video Reasoning Video LMMs

April 26, 2024

Learning text-to-video retrieval from image captioning
Lucas Ventura, Cordelia Schmid, Gül Varol
Image Captioning Image Text Retrieval Video Understanding Task Video Level Text to Video Retrieval Zero Shot Baseline

April 10, 2024

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention
Suleyman Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, Enkelejda Kasneci
Video Understanding Human Intent Video Understanding Task Action Anticipation Gaze Data Intention Recognition

April 4, 2024

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning
Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing
Open World Generated Caption Video Instance Segmentation Video Understanding Task Dense Video Open World Instance Segmentation

March 26, 2024

OmniVid: A Generative Framework for Universal Video Understanding
Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang
Video Understanding Video Question Answering Generative Framework Video Understanding Task Dense Video Captioning

March 24, 2024

AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue
Yunlong Tang, Daiki Shimada, Jing Bi, Chenliang Xu
Temporal Information Video Understanding Task Interleaving Method Video LLM Referring Expression Audio Visual Event Localization Contextual Alignment

March 18, 2024

March 14, 2024

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang
Video Understanding State Space Model Video Understanding Task Video Model Action on the Fly

February 20, 2024

VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong
Video Understanding Video Understanding Task Visual Semantic High Speed Video Understanding Benchmark Caption Pair

February 4, 2024

Spatio-temporal Prompting Network for Robust Video Feature Extraction
Guanxiong Sun, Chi Wang, Zhaoyu Zhang, Jiankang Deng, Stefanos Zafeiriou, Yang Hua
Video Understanding Video Instance Segmentation Video Understanding Task Video Object Detection Pre Trained Feature Extractor Spatial Prompt

January 24, 2024

Visual Objectification in Films: Towards a New AI Task for Video Interpretation
Julie Tores, Lucile Sassatelli, Hui-Yin Wu, Clement Bergman, Lea Andolfi, Victor Ecrement, Frederic Precioso, Thierry Devars, Magali Guaresi, Virginie Julliard, Sarah Lecossais
Video Understanding Video Understanding Task Movie Review AI Task Female Objectification

December 29, 2023

Video Understanding with Large Language Models: A Survey
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
Timely Survey Video Understanding Video Understanding Task

December 25, 2023

Open-Vocabulary Video Relation Extraction
Wentao Tian, Zheng Wang, Yuqian Fu, Jingjing Chen, Lechao Cheng
Video Understanding Task Relational Triple Multi Label Action

December 20, 2023

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
Ishan Rajendrakumar Dave, Simon Jenni, Mubarak Shah
Full Potential Video Representation Video Understanding Task Temporal Correspondence Self Supervised Video Temporal Self

December 16, 2023

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang
New Benchmark Video Understanding Video Summarization Deep Understanding Video Understanding Task

December 15, 2023

SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
Lee Hyun, Kim Sung-Bin, Seungju Han, Youngjae Yu, Tae-Hyun Oh
Language Model Source Video Multimodal Dataset Video Understanding Task Happy Image

December 11, 2023

Audio-Visual LLM for Video Understanding
Fangxun Shu, Lei Zhang, Hao Jiang, Cihang Xie
Multimodal Large Language Model Video Understanding Video Understanding Task Multi Modal Training Video LLM

November 30, 2023

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation
Linzi Xing, Quan Tran, Fabian Caba, Franck Dernoncourt, Seunghyun Yoon, Zhaowen Wang, Trung Bui, Giuseppe Carenini
Cross Modal Attention Video Understanding Task Contrastive Domain Adaptation

November 22, 2023

Vamos: Versatile Action Models for Video Understanding
Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun
Video Understanding Video Understanding Task Action Model Video Understanding Benchmark Video Caption