Self Supervised Video Representation
Self-supervised video representation learning aims to learn meaningful video features without relying on manual annotations, addressing the high cost of video labeling. Current research focuses on developing novel pretext tasks and augmentations, often employing masked autoencoders, contrastive learning, and transformer architectures, to effectively capture both static and dynamic aspects of videos. These advancements improve the efficiency and performance of downstream video analysis tasks like action recognition, video retrieval, and video summarization, impacting various fields including robotics and computer vision. The resulting representations are increasingly robust and generalize well across diverse datasets and tasks.
Papers
Hierarchical Self-supervised Representation Learning for Movie Understanding
Fanyi Xiao, Kaustav Kundu, Joseph Tighe, Davide Modolo
Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency
Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, Nong Sang
Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation
Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi
Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
Srijan Das, Michael S. Ryoo
ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints
Srijan Das, Michael S. Ryoo
Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning
Manlin Zhang, Jinpeng Wang, Andy J. Ma
TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning
Yang Liu, Keze Wang, Lingbo Liu, Haoyuan Lan, Liang Lin