Object Centric Video
Object-centric video analysis focuses on representing videos as collections of individual objects and their interactions over time, rather than processing the entire scene holistically. Current research emphasizes developing models, often based on transformer architectures and slot-based state space models, that can learn these object-centric representations from both supervised and unsupervised data, including leveraging visual-language models and depth information. This approach improves efficiency, interpretability, and transferability to downstream tasks like action recognition, 3D pose estimation, and video segmentation, particularly in challenging scenarios with occlusion or fast motion. The resulting advancements promise more robust and efficient video understanding systems with applications in robotics, augmented reality, and other fields.