Self Supervised Speech Representation Learning
Self-supervised speech representation learning aims to extract meaningful features from raw audio without relying on labeled data, enabling training of robust speech models for diverse tasks. Current research focuses on improving model efficiency (e.g., through knowledge distillation and pruning), enhancing robustness to noise and reverberation, and exploring different training objectives (e.g., contrastive learning, regression, and multi-task learning) often within architectures like HuBERT and wav2vec. These advancements are significant because they allow for training high-performing speech models using readily available unlabeled audio data, reducing the reliance on expensive and time-consuming data annotation, and expanding the possibilities for applications in low-resource settings.
Papers
Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding
Yifan Peng, Kwangyoun Kim, Felix Wu, Prashant Sridhar, Shinji Watanabe
A low latency attention module for streaming self-supervised speech representation learning
Jianbo Ma, Siqi Pan, Deepak Chandran, Andrea Fanelli, Richard Cartwright