Conversational Speech Synthesis

Conversational speech synthesis (CSS) aims to generate realistic and expressive speech within the context of a dialogue, focusing on natural prosody, emotion, and turn-taking. Current research emphasizes improving context modeling using techniques like heterogeneous graphs and contrastive learning, often incorporating large language models to enhance both semantic understanding and stylistic control. These advancements are driven by the need for larger, more diverse datasets, including those with natural conversational styles and emotional annotations, to improve the naturalness and expressiveness of synthesized speech, ultimately impacting applications like conversational AI and accessibility technologies.

Papers

January 11, 2025

Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis
Rui Liu, Zhenqi Jia, Feilong Bao, Haizhou Li
Dialogue System Expressive Speech Conversational Context Conversational Speech Synthesis

January 9, 2025

JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis
Jun-Hyeok Cha, Seung-Bin Kim, Hyung-Seok Oh, Seong-Whan Lee
Medical LLM Emotional Speech Conversational Speech Synthesis

December 25, 2024

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis
Zhenqi Jia, Rui Liu
Text Modality Speech Synthesis Conversational Context Dialogue History Modality Interaction Conversational Speech Synthesis

July 31, 2024

Generative Expressive Conversational Speech Synthesis
Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
Conversational Context Dialogue History Expressive Speech Synthesis Natural Conversation Conversational Speech Synthesis

June 6, 2024

Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation
Fanyou Wu, Weijie Xu, Chandan K. Reddy, Srinivasan H. Sengamedu
Dialogue System Dialogue Scenario ConvQA Face Challenge ConvQA Model Conversational Speech Synthesis

December 19, 2023

Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
Heterogeneous Graph Underlying Emotion Emotion Annotation Emotion Model Conversational Speech Synthesis

December 16, 2023

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis
Yayue Deng, Jinlong Xue, Yukang Jia, Qifei Li, Yichen Han, Fengping Wang, Yingming Gao, Dengfeng Ke, Ya Li
Contrastive Learning Synthesized Speech Prosodic Feature Contextual Understanding Conversational Speech Synthesis

August 31, 2023

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin Kang, Helen Meng
Spontaneous Speech Semi Supervised Training Spontaneous Motor Conversational Speech Synthesis Behavior Label

May 29, 2023

Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis
Erik Ekstedt, Siyang Wang, Éva Székely, Joakim Gustafson, Gabriel Skantze
Speech to Text Automatic Evaluation Traffic Sign Conversational Speech Synthesis Turn Taking

February 7, 2023

PLACES: Prompting Language Models for Social Conversation Synthesis
Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, Dilek Hakkani-Tur
Conversational Data Multi Party Synthetic Dialogue Place Conversational Speech Synthesis

Conversational Speech Synthesis

Papers

Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Generative Expressive Conversational Speech Synthesis

Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation

Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis

PLACES: Prompting Language Models for Social Conversation Synthesis