Prosody Modeling
Prosody modeling in speech synthesis aims to generate speech with natural intonation, rhythm, and stress, enhancing the expressiveness and naturalness of synthetic voices. Current research focuses on improving prosody control through various techniques, including reinforcement learning, diffusion models, and hierarchical architectures that leverage both global and local prosodic features, often incorporating linguistic information like syntax and phoneme-level details. These advancements are crucial for creating more human-like synthetic speech, impacting applications such as text-to-speech systems, voice assistants, and expressive speech synthesis for various languages and speakers. Furthermore, efficient automatic prosody annotation methods are being developed to reduce the reliance on expensive manual labeling.
Papers
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control
Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis