Multimodal Video Understanding
Multimodal video understanding aims to enable computers to comprehensively analyze videos by integrating information from various modalities like visual, audio, and textual data. Current research focuses on developing robust models, often leveraging large language models and transformer architectures, that can handle long videos, incomplete data, and diverse tasks such as question answering, anomaly detection, and captioning. This field is crucial for advancing applications in healthcare (e.g., automated diagnosis), content analysis (e.g., efficient video summarization), and surveillance (e.g., crowd monitoring), driving significant progress in both model design and benchmark development.
Papers
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin