Emotional Text to Speech
Emotional text-to-speech (TTS) aims to synthesize speech that accurately conveys intended emotions, moving beyond neutral speech generation. Current research focuses on improving controllability of emotional expression through techniques like direct preference optimization, contrastive learning with diffusion models, and leveraging large language models for nuanced emotional guidance. This field is significant because it enhances human-computer interaction and has applications in areas such as virtual assistants, dubbing, and accessibility technologies, particularly for individuals with communication impairments. The development of more robust and controllable emotional TTS systems relies on addressing challenges such as fine-grained emotion intensity control and generalization to unseen speakers and languages.
Papers
VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech
Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee