Prosody Encoder

Prosody encoders are neural network components designed to extract and represent the melodic and rhythmic aspects (prosody) of speech, crucial for natural and expressive speech synthesis and understanding. Current research focuses on improving the disentanglement of prosody from other speech features like speaker identity and semantic content, often employing unsupervised learning techniques and integrating prosody information into end-to-end models for tasks such as text-to-speech and dialogue act classification. These advancements are significantly impacting the field by enabling more natural and emotionally nuanced speech synthesis, improving the accuracy of speech recognition systems, and facilitating the development of more human-like conversational agents.

Papers

December 2, 2024

FreeCodec: A disentangled neural speech codec with fewer tokens
Youqiang Zheng, Weiping Tu, Yueteng Kang, Jie Chen, Yike Zhang, Li Xiao, Yuhong Yang, Long Ma
Disentangled Representation Neural Speech Low Bitrate Prosody Encoder

February 22, 2024

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition
Rendi Chevi, Alham Fikri Aji
Text to Speech Experienced Emotion Prosodic Feature Microbial Decomposition Speech Naturalness Prosody Encoder

September 25, 2023

Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping
Minki Kang, Wooseok Han, Eunho Yang
Zero Shot Face Image Zero Shot Text to Speech Face Voice Prosody Encoder

May 20, 2023

ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios
Yuyue Wang, Huan Xiao, Yihan Wu, Ruihua Song
Text Modality Speech Analysis Human Humor Low Resource Scenario Prosody Encoder

December 14, 2022

Disentangling Prosody Representations with Unsupervised Speech Reconstruction
Leyuan Qu, Taihao Li, Cornelius Weber, Theresa Pekarek-Rosin, Fuji Ren, Stefan Wermter
Prosodic Feature Speech Reconstruction Prosody Encoder

November 9, 2022

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features
Ziqian Ning, Qicong Xie, Pengcheng Zhu, Zhichao Wang, Liumeng Xue, Jixun Yao, Lei Xie, Mengxiao Bi
Voice Conversion Expressive Speech Major Challenge Bottleneck Feature Perturbation Audio Encoder Attention Fusion Prosody Encoder

July 4, 2022

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
Tao Li, Xinsheng Wang, Qicong Xie, Zhichao Wang, Mingqi Jiang, Lei Xie
End to End Prosody Encoder Emotion Transition

May 11, 2022

A neural prosody encoder for end-ro-end dialogue act classification
Kai Wei, Dillon Knox, Martin Radfar, Thanh Tran, Markus Muller, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris, Maurizio Omologo
Dialogue System Spoken Language Understanding Prosodic Feature Gating Mechanism Prosody Encoder Act Classification

February 16, 2022

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech
Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao
Text to Speech Prosodic Feature Vector Quantization Expressive Text to Speech Prosody Encoder

November 19, 2021

Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
End to End Prosodic Feature Prosody Modeling Prosody Encoder Prosody Control

Prosody Encoder

Papers

FreeCodec: A disentangled neural speech codec with fewer tokens

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition

Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

A neural prosody encoder for end-ro-end dialogue act classification

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis