Self Supervised Video Representation
Self-supervised video representation learning aims to learn meaningful video features without relying on manual annotations, addressing the high cost of video labeling. Current research focuses on developing novel pretext tasks and augmentations, often employing masked autoencoders, contrastive learning, and transformer architectures, to effectively capture both static and dynamic aspects of videos. These advancements improve the efficiency and performance of downstream video analysis tasks like action recognition, video retrieval, and video summarization, impacting various fields including robotics and computer vision. The resulting representations are increasingly robust and generalize well across diverse datasets and tasks.
Papers
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak Shah
SELF-VS: Self-supervised Encoding Learning For Video Summarization
Hojjat Mokhtarabadi, Kave Bahraman, Mehrdad HosseinZadeh, Mahdi Eftekhari