Self Supervised Audio Visual

Self-supervised audio-visual learning aims to create robust representations of audio and visual data by training models on unlabeled data, overcoming limitations of supervised methods that require large, manually annotated datasets. Current research focuses on contrastive learning and masked autoencoder architectures, often incorporating techniques like equivariance to handle data augmentations and hierarchical structures to learn multi-level features. These advancements are significantly improving performance on tasks like emotion recognition, speech recognition, and video inpainting, demonstrating the potential for more efficient and generalizable audio-visual systems.

Papers