Video Understanding
Video understanding aims to enable computers to comprehend the content and context of videos, mirroring human capabilities in interpreting visual and auditory information over time. Current research heavily focuses on improving the temporal reasoning abilities of large multimodal models (LLMs) and addressing limitations in handling long videos, often employing architectures that integrate visual encoders with LLMs or leverage novel spatiotemporal modeling techniques. This field is crucial for advancing applications in healthcare (e.g., patient education), autonomous driving, and multimedia analysis, driving the development of more robust and efficient video processing methods.
Papers
ContextDet: Temporal Action Detection with Adaptive Context Aggregation
Ning Wang, Yun Xiao, Xiaopeng Peng, Xiaojun Chang, Xuanhong Wang, Dingyi Fang
Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison
Shiyu Hu, Xuchen Li, Xuzhao Li, Jing Zhang, Yipei Wang, Xin Zhao, Kang Hao Cheong
AirLetters: An Open Video Dataset of Characters Drawn in the Air
Rishit Dagli, Guillaume Berger, Joanna Materzynska, Ingo Bax, Roland Memisevic
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Jianrui Zhang, Mu Cai, Yong Jae Lee
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM
Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing Zhang, Kaiqi Huang