Video Understanding Task
Video understanding research aims to enable computers to interpret the content and context of videos, encompassing tasks like action recognition, video captioning, and question answering. Current efforts focus on developing robust and efficient models, often leveraging large language models (LLMs) and multimodal architectures, including transformers and graph neural networks, to process both visual and auditory information and handle long-term temporal dependencies. These advancements are crucial for applications ranging from automated video indexing and summarization to more complex tasks such as autonomous driving and medical diagnosis, driving significant progress in both computer vision and artificial intelligence.
Papers
Foundation Models for Video Understanding: A Survey
Neelu Madan, Andreas Moegelmose, Rajat Modi, Yogesh S. Rawat, Thomas B. Moeslund
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan