Self Supervised Speech Representation

Self-supervised speech representation learning aims to create powerful speech embeddings from vast amounts of unlabeled audio data, improving downstream tasks like speech recognition and enhancement without relying heavily on transcribed data. Current research focuses on refining model architectures like Wav2Vec 2.0, HuBERT, and XLSR, investigating the properties of these representations (e.g., orthogonality of speaker and phonetic information), and addressing biases in performance across different language varieties. This field is significant because it enables advancements in speech technology for low-resource languages and diverse speaker populations, while also providing insights into the fundamental nature of speech representation itself.

Papers