Expressive Text to Speech

Expressive Text-to-Speech (TTS) aims to synthesize speech that naturally conveys emotion, style, and other nuanced aspects of human communication. Current research heavily focuses on improving control over these expressive qualities, often employing diffusion models and large language models to leverage natural language prompts or reference audio for style transfer. This involves developing robust methods for representing and manipulating prosody, and addressing challenges like data scarcity and the need for generalization across speakers and styles. Advances in expressive TTS have significant implications for applications ranging from accessibility technologies to more engaging virtual assistants and creative content generation.

Papers