Emotional Text to Speech

Emotional text-to-speech (TTS) aims to synthesize speech that accurately conveys intended emotions, moving beyond neutral speech generation. Current research focuses on improving controllability of emotional expression through techniques like direct preference optimization, contrastive learning with diffusion models, and leveraging large language models for nuanced emotional guidance. This field is significant because it enhances human-computer interaction and has applications in areas such as virtual assistants, dubbing, and accessibility technologies, particularly for individuals with communication impairments. The development of more robust and controllable emotional TTS systems relies on addressing challenges such as fine-grained emotion intensity control and generalization to unseen speakers and languages.

Papers