Self Supervised Audio Visual
Self-supervised audio-visual learning aims to create robust representations of audio and visual data by training models on unlabeled data, overcoming limitations of supervised methods that require large, manually annotated datasets. Current research focuses on contrastive learning and masked autoencoder architectures, often incorporating techniques like equivariance to handle data augmentations and hierarchical structures to learn multi-level features. These advancements are significantly improving performance on tasks like emotion recognition, speech recognition, and video inpainting, demonstrating the potential for more efficient and generalizable audio-visual systems.
Papers
March 14, 2024
January 11, 2024
December 14, 2023
October 11, 2023
December 15, 2022
April 29, 2022
November 8, 2021