Self Supervised Speech Representation

Self-supervised speech representation learning aims to create powerful speech embeddings from vast amounts of unlabeled audio data, improving downstream tasks like speech recognition and enhancement without relying heavily on transcribed data. Current research focuses on refining model architectures like Wav2Vec 2.0, HuBERT, and XLSR, investigating the properties of these representations (e.g., orthogonality of speaker and phonetic information), and addressing biases in performance across different language varieties. This field is significant because it enables advancements in speech technology for low-resource languages and diverse speaker populations, while also providing insights into the fundamental nature of speech representation itself.

Papers

August 2, 2023

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
Ramanan Sivaguru, Vasista Sai Lodagala, S Umesh
Synthesized Speech Speech Quality Self Supervised Speech Representation Text to Speech Synthesis Fastspeech2 Architecture

July 30, 2023

Mispronunciation detection using self-supervised speech representations
Jazmin Vidal, Pablo Riera, Luciana Ferrer
Self Supervised Learning Self Supervised Speech Representation Mispronunciation Detection

July 27, 2023

The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions
George Close, Thomas Hain, Stefan Goetze
Loss Function Mixed Effect Speech Enhancement Self Supervised Speech Representation Spoken Language

July 25, 2023

Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals using Self Supervised Speech Representations
George Close, Thomas Hain, Stefan Goetze
Speech Enhancement Speech Quality Self Supervised Speech Representation Intellectual Disability

July 11, 2023

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis
Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely
Text to Speech Speech Synthesis Greater Public Use Speech Representation Synthesized Speech Self Supervised Speech Representation Mean Opinion Score Spontaneous Speech Synthesis

June 2, 2023

DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Jinfeng Bai
Lightweight High Self Supervised Speech Representation Cross Lingual Model Cross Lingual Performance Cross Lingual Representation Cross Lingual Speech Representation

June 1, 2023

May 31, 2023

Intelligible Lip-to-Speech Synthesis with Speech Units
Jeongsoo Choi, Minsu Kim, Yong Man Ro
Mel Spectrogram Self Supervised Speech Representation Lip Movement Lip to Speech Synthesis

May 21, 2023

Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces
Oli Liu, Hao Tang, Sharon Goldwater
Predictive Coding Self Supervised Speech Representation Phonetic Information Speaker Normalization

May 19, 2023

MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting
Neil Shah, Vishal Tambrahalli, Saiteja Kosgi, Niranjan Pedanekar, Vineet Gandhi
Text to Speech Speech Synthesis Low Resource Self Supervised Speech Representation High Quality Speech Multi Speaker Text to Speech

March 15, 2023

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences
Yuan Tseng, Cheng-I Lai, Hung-yi Lee
Automatic Speech Recognition Self Supervised Speech Representation Cascade Model Unsupervised Parsing

March 14, 2023

Lightweight feature encoder for wake-up word detection based on self-supervised speech representation
Hyungjun Lim, Younggwan Kim, Kiho Yeom, Eunjoo Seo, Hoodong Lee, Stanley Jungkyu Choi, Honglak Lee
Self Supervised Audio Representation Self Supervised Speech Representation Lightweight Encoder Wake Word Downstream Speech

March 5, 2023

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely
Comparative Study Speech Representation Read V Self Supervised Speech Representation Tt Model

March 1, 2023

ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations
Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha Sahipjohn, Niranjan Pedanekar, Vineet Gandhi
Indian Language Self Supervised Speech Representation Multilingual Scenario Text to Speech Synthesis Multilingual Tt

February 24, 2023

Phone and speaker spatial organization in self-supervised speech representations
Pablo Riera, Manuela Cerdeiro, Leonardo Pepino, Luciana Ferrer
Self Supervised Speech Representation Representational Similarity Speech Driven Speech Segment Spatial Structure

February 16, 2023

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations
Shehzeen Hussain, Paarth Neekhara, Jocelyn Huang, Jason Li, Boris Ginsburg
Adaptive Importance Speech Representation Self Supervised Speech Representation Speaker Representation Zero Shot Voice Conversion Ace Opencpop

February 11, 2023

Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation
Cong Han, Vishal Choudhari, Yinghao Aaron Li, Nima Mesgarani
Speech Representation Self Supervised Speech Representation Auditory Attention Speech Envelope Multi Talker Environment

January 11, 2023

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement
George Close, William Ravenscroft, Thomas Hain, Stefan Goetze
Loss Function Speech Enhancement Speech Representation Speech Intelligibility Self Supervised Speech Representation Neural Speech Enhancement

December 7, 2022

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information
Fenglin Ding, Genshun Wan, Pengcheng Li, Jia Pan, Cong Liu
Self Supervised Multilingual Automatic Speech Recognition Self Supervised Speech Representation Pre Trained Multilingual Model

Self Supervised Speech Representation

Papers

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

Mispronunciation detection using self-supervised speech representations

The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions

Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals using Self Supervised Speech Representations

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Intelligible Lip-to-Speech Synthesis with Speech Units

Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

Lightweight feature encoder for wake-up word detection based on self-supervised speech representation

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations

Phone and speaker spatial organization in self-supervised speech representations

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information