Video Understanding Task
Video understanding research aims to enable computers to interpret the content and context of videos, encompassing tasks like action recognition, video captioning, and question answering. Current efforts focus on developing robust and efficient models, often leveraging large language models (LLMs) and multimodal architectures, including transformers and graph neural networks, to process both visual and auditory information and handle long-term temporal dependencies. These advancements are crucial for applications ranging from automated video indexing and summarization to more complex tasks such as autonomous driving and medical diagnosis, driving significant progress in both computer vision and artificial intelligence.
Papers
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, Gang Hua
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li