Video Understanding Task
Video understanding research aims to enable computers to interpret the content and context of videos, encompassing tasks like action recognition, video captioning, and question answering. Current efforts focus on developing robust and efficient models, often leveraging large language models (LLMs) and multimodal architectures, including transformers and graph neural networks, to process both visual and auditory information and handle long-term temporal dependencies. These advancements are crucial for applications ranging from automated video indexing and summarization to more complex tasks such as autonomous driving and medical diagnosis, driving significant progress in both computer vision and artificial intelligence.
Papers
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak Shah
System-status-aware Adaptive Network for Online Streaming Video Understanding
Lin Geng Foo, Jia Gong, Zhipeng Fan, Jun Liu