Prosodic Feature
Prosodic features, encompassing aspects of speech like pitch, intensity, and rhythm, are crucial for conveying meaning and emotion beyond the literal words spoken. Current research focuses on accurately modeling and manipulating these features in applications such as speech synthesis, editing, and voice conversion, often employing deep learning models like diffusion models, variational autoencoders, and actor-critic reinforcement learning. This work is significant for improving the naturalness and expressiveness of synthetic speech, enhancing accessibility for individuals with communication disorders, and advancing our understanding of human communication itself.
Papers
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis
Konstantinos Klapsas, Nikolaos Ellinas, June Sig Sung, Hyoungmin Park, Spyros Raptis
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford, Miaosen Wang, Ye Jia, Tal Remez