Video Representation
Video representation research aims to create efficient and effective ways to encode and process video data for various applications. Current efforts focus on developing novel architectures, including implicit neural representations (INRs), transformers, and hybrid models combining convolutional neural networks (CNNs) and transformers, often incorporating self-supervised learning and leveraging multimodal information (e.g., audio, text). These advancements improve video compression, enhance downstream tasks like action recognition and video retrieval, and enable new capabilities such as video editing and generation. The resulting improvements in video understanding and manipulation have significant implications for fields ranging from surveillance and monitoring to entertainment and healthcare.
Papers
Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
Srijan Das, Michael S. Ryoo
Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning
Manlin Zhang, Jinpeng Wang, Andy J. Ma
Time-Equivariant Contrastive Video Representation Learning
Simon Jenni, Hailin Jin