Paper ID: 2410.15728 • Published Oct 21, 2024
Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases
Cristian Meo, Akihiro Nakano, Mircea Lică, Aniket Didolkar, Masahiro Suzuki, Anirudh Goyal, Mengmi Zhang, Justin Dauwels...
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Unsupervised object-centric learning from videos is a promising approach
towards learning compositional representations that can be applied to various
downstream tasks, such as prediction and reasoning. Recently, it was shown that
pretrained Vision Transformers (ViTs) can be useful to learn object-centric
representations on real-world video datasets. However, while these approaches
succeed at extracting objects from the scenes, the slot-based representations
fail to maintain temporal consistency across consecutive frames in a video,
i.e. the mapping of objects to slots changes across the video. To address this,
we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework
that enhances the temporal consistency of extracted object-centric
representations in video-centric vision tasks. Leveraging an autoregressive
prior network to condition representations on previous timesteps and a novel
consistency loss function, CA-SA predicts future slot representations and
imposes consistency across frames. We present qualitative and quantitative
results showing that our proposed method outperforms the considered baselines
on downstream tasks, such as video prediction and visual question-answering
tasks.