Controllable Text to Speech

Controllable text-to-speech (TTS) aims to synthesize speech not only from text input but also with precise control over various aspects like speaker identity, speaking style, and emotional expression, all guided by natural language descriptions. Current research focuses on developing models that achieve this control using techniques such as decoder-only transformers, normalizing flows to model variance in speech features, and multi-modal approaches incorporating text and speech information. These advancements are improving the naturalness and robustness of synthesized speech, leading to applications in areas like personalized voice assistants, accessible communication technologies, and more expressive audio content creation.

Papers