Video Understanding Benchmark
Video understanding benchmarks are designed to evaluate the ability of artificial intelligence models to comprehend and reason about video content, particularly focusing on challenges posed by long videos. Current research emphasizes developing models that effectively handle long temporal dependencies and diverse video types, often employing large multimodal language models (LLMs) combined with advanced visual encoders and memory mechanisms, or exploring alternative architectures like state space models for improved efficiency. These benchmarks are crucial for advancing the field by providing standardized evaluations, identifying limitations in existing models, and driving the development of more robust and efficient video understanding systems with applications ranging from video summarization to real-time video analysis.
Papers
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang