Video Understanding
Video understanding aims to enable computers to comprehend the content and context of videos, mirroring human capabilities in interpreting visual and auditory information over time. Current research heavily focuses on improving the temporal reasoning abilities of large multimodal models (LLMs) and addressing limitations in handling long videos, often employing architectures that integrate visual encoders with LLMs or leverage novel spatiotemporal modeling techniques. This field is crucial for advancing applications in healthcare (e.g., patient education), autonomous driving, and multimedia analysis, driving the development of more robust and efficient video processing methods.
Papers
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs
Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, Volker Tresp
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai