Speech Naturalness

Speech naturalness in text-to-speech (TTS) synthesis focuses on generating synthetic speech indistinguishable from human speech, prioritizing accurate prosody, timbre, and overall quality. Current research emphasizes disentangling speech components (content, prosody, timbre) using techniques like factorized diffusion models and variational autoencoders (VAEs), often coupled with large-scale datasets and billion-parameter models. These advancements aim to improve the realism and emotional expressiveness of synthetic speech, impacting fields like virtual assistants, accessibility technologies, and entertainment.

Papers