Controllable Speech Synthesis

Controllable speech synthesis aims to generate speech with precise control over various aspects like content, speaker identity, emotion, and prosody. Current research focuses on developing models, often based on large language models and neural networks (including variational autoencoders and flow-based models), that allow for this fine-grained manipulation through various input prompts (text, audio, or discrete labels). This field is significant for its potential to improve applications such as personalized text-to-speech systems, voice cloning for dubbing, and the creation of more expressive and natural-sounding conversational AI. Furthermore, advancements in controllable synthesis are enabling the creation of synthetic datasets for improving other speech technologies like automatic speech recognition.

Papers