Prosodic Feature

Prosodic features, encompassing aspects of speech like pitch, intensity, and rhythm, are crucial for conveying meaning and emotion beyond the literal words spoken. Current research focuses on accurately modeling and manipulating these features in applications such as speech synthesis, editing, and voice conversion, often employing deep learning models like diffusion models, variational autoencoders, and actor-critic reinforcement learning. This work is significant for improving the naturalness and expressiveness of synthetic speech, enhancing accessibility for individuals with communication disorders, and advancing our understanding of human communication itself.

Papers

December 16, 2023

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis
Yayue Deng, Jinlong Xue, Yukang Jia, Qifei Li, Yichen Han, Fengping Wang, Yingming Gao, Dengfeng Ke, Ya Li
Contrastive Learning Synthesized Speech Prosodic Feature Contextual Understanding Conversational Speech Synthesis

November 28, 2023

Quantifying the redundancy between prosody and text
Lukas Wolf, Tiago Pimentel, Evelina Fedorenko, Ryan Cotterell, Alex Warstadt, Ethan Wilcox, Tamar Regev
Text Modality Prosodic Feature Information Redundancy Linguistic Information

November 13, 2023

SponTTS: modeling and transferring spontaneous style for TTS
Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie
Text to Speech Prosodic Feature Style Representation Speech Generation Spontaneous Speech

October 23, 2023

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes
Seongho Joo, Hyukhun Koh, Kyomin Jung
Speech Analysis Prosodic Feature Determinantal Point Process Speech Segment Current Tt System Neural Text to Speech

October 21, 2023

Automatic Pronunciation Assessment -- A Review
Yassine El Kheir, Ahmed Ali, Shammur Absar Chowdhury
Deep Learning Natural Language Processing Narrative Review Prosodic Feature Automatic Pronunciation Assessment Pronunciation Assessment Pronunciation Training

October 14, 2023

Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling
Tiberiu Boros, Stefan Daniel Dumitrescu, Ionut Mironica, Radu Chivereanu
End to End Voice Conversion Generative Adversarial Prosodic Feature High Fidelity Vocoder Text to Speech Synthesis Speech Input

October 10, 2023

Prosody Analysis of Audiobooks
Charuta Pethe, Bach Pham, Felix D Childress, Yunting Yin, Steven Skiena
Prosodic Feature Prosody Prediction

September 25, 2023

HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS
Dake Guo, Xinfa Zhu, Liumeng Xue, Tao Li, Yuanjun Lv, Yuepeng Jiang, Lei Xie
Graph Neural Network Text to Speech Prosodic Feature Synthetic Voice Demographic Parity

September 21, 2023

September 11, 2023

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
Jinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu
Prosodic Feature Contrastive Pretraining Prosody Modeling Controllable Text to Speech Silent Speech

September 9, 2023

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations
Debaditya Shome, Ali Etemad
Speech Emotion Recognition Prosodic Feature Target Emotion Cross Modal Knowledge Distillation

September 6, 2023

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature
Kyungguen Byun, Sunkuk Moon, Erik Visser
Voice Conversion Prosodic Feature

July 31, 2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba
Diffusion Model Normalizing Flow Prosodic Feature Mel Spectrogram Acoustic Modeling

July 30, 2023

Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation
Yuanhao Chen
Chinese Character Prosodic Feature Natural Sounding Speech Word Segmentation Addressing Tone Sandhi

July 5, 2023

Going Retro: Astonishingly Simple Yet Effective Rule-based Prosody Modelling for Speech Synthesis Simulating Emotion Dimensions
Felix Burkhardt, Uwe Reichel, Florian Eyben, Björn Schuller
Speech Synthesis Prosodic Feature Emotion Annotation Speech Synthesizer

June 20, 2023

eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer
Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman
Long Context Prosodic Feature End to End Model Multi Speaker Tt Fine Grained Prosody

June 13, 2023

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling
Ji-Sang Hwang, Sang-Hoon Lee, Seong-Whan Lee
Speech Synthesis Synthesized Speech Prosodic Feature Speech Encoder Prosody Modeling Speech Pause

June 10, 2023

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model
Mu Yang, Ram C. M. C. Shekar, Okim Kang, John H. L. Hansen
Prosodic Feature Accent Recognition Phoneme Alignment

Prosodic Feature

Papers

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Quantifying the redundancy between prosody and text

SponTTS: modeling and transferring spontaneous style for TTS

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

Automatic Pronunciation Assessment -- A Review

Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling

Prosody Analysis of Audiobooks

HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation

Going Retro: Astonishingly Simple Yet Effective Rule-based Prosody Modelling for Speech Synthesis Simulating Emotion Dimensions

eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model