Speech Representation

Speech representation research focuses on creating effective numerical encodings of spoken language, aiming to capture both linguistic content and speaker-specific characteristics for various downstream tasks like speech recognition and voice conversion. Current research heavily utilizes transformer-based architectures and self-supervised learning methods, exploring techniques like masked prediction and contrastive learning to learn robust representations from large, unlabeled datasets. These advancements are driving improvements in efficiency and accuracy across numerous applications, including automatic speech recognition, speaker identification, and speech synthesis, while also revealing insights into the internal workings of these complex models. Furthermore, efforts are underway to improve the disentanglement of content and speaker information within these representations, leading to more robust and versatile models.

Papers

May 25, 2023

DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion
Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee
Generative Model Speech Representation Disentangled Representation Denoising Diffusion Model Spatio Temporal Mixup Mechanism Voice Style Transfer Robust Generative

May 18, 2023

Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering
Heng-Jui Chang, Alexander H. Liu, James Glass
Speech Recognition Speech Representation Semantic Representation Acoustic Unit Self Supervised Speech Representation Model

May 2, 2023

Contrastive Speech Mixup for Low-resource Keyword Spotting
Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Eng Siong Chng, Bin Ma
Contrastive Loss Speech Representation Keyword Spotting Google Speech Command Mixup Augmentation

April 27, 2023

Understanding Shared Speech-Text Representations
Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang
Speech Representation Source Free Domain Adaptation Speech Model Joint Speech Text

April 24, 2023

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model
Kenichi Fujita, Takanori Ashihara, Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima
Zero Shot Speech Representation Critical Synthesis Zero Shot Text to Speech Voice Style Transfer Self Supervised Speech Representation Model

April 8, 2023

Unsupervised Speech Representation Pooling Using Vector Quantization
Jeongkyun Park, Kwanghee Choi, Hyunjun Heo, Hyung-Min Park
Speech Representation Audio Representation Vector Quantization Speaker Identification Attention Based Pooling Unsupervised Speech Representation

March 19, 2023

Textless Speech-to-Music Retrieval Using Emotion Similarity
SeungHeon Doh, Minz Won, Keunwoo Choi, Juhan Nam
Text Modality App to App Retrieval Speech Representation Audio Text Retrieval Emotion Space

March 14, 2023

Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference
Biao Fu, Minpeng Liao, Kai Fan, Zhongqiang Huang, Boxing Chen, Yidong Chen, Xiaodong Shi
Scientific Inference Speech Representation Speech Translation Predictive Inference Offline Speech Translation

March 13, 2023

Analysing the Masked predictive coding training criterion for pre-training a Speech Representation Model
Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah
Pre Training Speech Representation Speaker Information Learned Model

March 5, 2023

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely
Comparative Study Speech Representation Read V Self Supervised Speech Representation Tt Model

March 3, 2023

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations
Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani
Speech Representation Text Representation Speech Generation Speech Segment Speech Restoration Speech Distortion Unsupervised Speech Enhancement

February 28, 2023

deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition
Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, Jinjie Ni, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma
Self Supervised Speech Recognition Industrial Disturbing Noise Speech Representation Speech Recognition System Noisy Embeddings

February 27, 2023

A low latency attention module for streaming self-supervised speech representation learning
Jianbo Ma, Siqi Pan, Deepak Chandran, Andrea Fanelli, Richard Cartwright
Automatic Speech Recognition Transformer Architecture Speech Representation Attention Module Speech Processing Self Supervised Speech Representation Learning

February 16, 2023

February 11, 2023

Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation
Cong Han, Vishal Choudhari, Yinghao Aaron Li, Nima Mesgarani
Speech Representation Self Supervised Speech Representation Auditory Attention Speech Envelope Multi Talker Environment

January 11, 2023

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement
George Close, William Ravenscroft, Thomas Hain, Stefan Goetze
Loss Function Speech Enhancement Speech Representation Speech Intelligibility Self Supervised Speech Representation Neural Speech Enhancement

December 29, 2022

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models
Yinghao Aaron Li, Cong Han, Nima Mesgarani
Knowledge Transfer Voice Conversion Speech Representation One Shot Voice Conversion

December 14, 2022

Efficient Speech Representation Learning with Low-Bit Quantization
Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Abdelrahman Mohamed
Model Compression Speech Representation Quantization Technique Low Bit Quantization

December 7, 2022

Progressive Multi-Scale Self-Supervised Learning for Speech Recognition
Genshun Wan, Tan Liu, Hang Chen, Jia Pan, Cong Liu, Zhongfu Ye
Automatic Speech Recognition Self Supervised Speech Recognition Multi Scale Speech Representation Automatic Speech Recognition Performance

Speech Representation

Papers

DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering

Contrastive Speech Mixup for Low-resource Keyword Spotting

Understanding Shared Speech-Text Representations

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

Unsupervised Speech Representation Pooling Using Vector Quantization

Textless Speech-to-Music Retrieval Using Emotion Similarity

Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

Analysing the Masked predictive coding training criterion for pre-training a Speech Representation Model

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

A low latency attention module for streaming self-supervised speech representation learning

Speech Enhancement with Multi-granularity Vector Quantization

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Improved Decoding of Attentional Selection in Multi-Talker Environments with Self-Supervised Learned Speech Representation

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

Efficient Speech Representation Learning with Low-Bit Quantization

Progressive Multi-Scale Self-Supervised Learning for Speech Recognition