Self Supervised Speech Representation Learning

Self-supervised speech representation learning aims to extract meaningful features from raw audio without relying on labeled data, enabling training of robust speech models for diverse tasks. Current research focuses on improving model efficiency (e.g., through knowledge distillation and pruning), enhancing robustness to noise and reverberation, and exploring different training objectives (e.g., contrastive learning, regression, and multi-task learning) often within architectures like HuBERT and wav2vec. These advancements are significant because they allow for training high-performing speech models using readily available unlabeled audio data, reducing the reliance on expensive and time-consuming data annotation, and expanding the possibilities for applications in low-resource settings.

Papers