Video Benchmark

Video benchmarks are standardized datasets and evaluation protocols used to assess the performance of video understanding models, aiming to drive progress in areas like action recognition, video question answering, and object tracking. Current research focuses on developing more comprehensive benchmarks that address limitations of existing datasets, such as handling long videos, continuous perception, and diverse modalities (e.g., visible and thermal). This includes the development of novel model architectures, such as those incorporating contrastive learning, transformer-based approaches, and memory networks, to improve accuracy and efficiency. The resulting advancements in video understanding have significant implications for various applications, including autonomous driving, video surveillance, and assistive technologies for the visually impaired.

29papers

Papers

May 21, 2025

AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection
Zhipei Xu, Xuanyu Zhang, Xing Zhou, Jian Zhang
Peking University●RabbitPre AI
Video Benchmark Video Forensics Fake Video Forgery Detection Visual Reinforcement Learning

May 3, 2025

VideoLLM Benchmarks and Evaluation: A Survey
Yogesh Kumar
Indian Institute of Technology Jodhpur
Video Understanding Large Language Model Video LLM Video Benchmark Timely Survey Global Evaluation

April 16, 2025

AdaVid: Adaptive Video-Language Pretraining
Chaitanya Patel, Juan Carlos Niebles, Ehsan Adeli
Stanford University
Video Representation Video Pair Video Benchmark Video Language Video Encoder Contrastive Vision Language

April 10, 2025

GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
Lang Lin, Xueyang Yu, Ziqi Pang, Yu-Xiong Wang
University of Illinois Urbana-Champaign
Large Language Model Video Benchmark MLLM Based Vision Task Video Segmentation Global Reasoning

April 8, 2025

PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding, Kui Zhang, Jinahua Han, Lanqing Hong, Hang Xu, Xiaomeng Li
The Hong Kong University of Science and Technology●Huawei Noah’s Ark Lab
Video Data Augmentation Direct Preference Optimization Video Benchmark Preference Learning

March 29, 2025

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng
Beijing Institute for General Artificial Intelligence●State Key Laboratory of General Artificial Intelligence●Peking University●Shanghai Jiao Tong...+1
Streaming Inference Video Benchmark Online Streaming Video Context Multimodal Benchmark Language Model

March 27, 2025

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Shuming Liu, Chen Zhao, Tianqi Xu, Bernard Ghanem
King Abdullah University of Science and Technology (KAUST)
Video Understanding Large Vision Language Model Video Benchmark Long Span Training Data Large Video Language Model

March 24, 2025

March 14, 2025

FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding
Tsinghua University●Tsinghua University●JD.com●Ltd.●South China University of Technology
Video Benchmark Large Video Language Model Video Token Large Language Model

March 13, 2025

VMBench: A Benchmark for Perception-Aligned Video Motion Generation
Xinrang Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu
Alibaba Group●Chinese Academy of Sciences●University of Chinese Academy of Sciences
New Benchmark Motion Perception Video Benchmark

March 12, 2025

Generative Frame Sampler for Long Video Understanding
Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, Junnan Li
Peking University●Salesforce Research●Independent Researcher
Long Form Video Understanding Long Form Video Video Benchmark Large Language Model Long Video Understanding

February 25, 2025

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
USTC●Alibaba Group●Shanghai Jiao Tong University
Faithful Generation Video Benchmark Dense Video Retrieval Augmented Multi Modal Image Retrieval Iterative Reasoning

February 18, 2025

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation
Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, José Lezama, Irfan Essa, David Ross, Jonathan Huang
KAIST●Google DeepMind●Luma AI●Georgia Tech●Scaled Foundations
Video Benchmark Text to Video Generation Multi Agent Learning Memory Augmented Transformer Video Generation Long Video Generation Variable Length

January 9, 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang+4
Real World Video Video LLM Video Reasoning Video Understanding Human Understanding Video Benchmark

January 8, 2025

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu+3
Medical LLM Long Video Understanding Structured Document Spatial Coherence Layout Annotation Video Benchmark Long Form Video Understanding Semantic Graph

January 7, 2025

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
Efficient Image Visual Token Large Multimodal Model LLaVA HD Video Benchmark Visual Encoder Source Video

January 3, 2025

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Heqing Zou, Tianze Luo, Guiyang Xie, Victor Xiao Jie Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang+1
Video Benchmark Video Understanding Long Form Video Understanding Long Span Video Understanding Model Long Term Video

December 11, 2024

StreamChat: Chatting with Streaming Video
Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare
Video Benchmark Video Streaming Video LMMs Large Multimodal Model

December 9, 2024

SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
Zhaorun Chen, Francesco Pinto, Minzhou Pan, Bo Li
Guardrail Model Video Benchmark Safety Filter

Video Benchmark

Papers

AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection

VideoLLM Benchmarks and Evaluation: A Survey

AdaVid: Adaptive Video-Language Pretraining

GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

LLaVAction: evaluating and training multi-modal large language models for action recognition

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Generative Frame Sampler for Long Video Understanding

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

StreamChat: Chatting with Streaming Video

SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations