Neural Speech Synthesis

Neural speech synthesis aims to generate human-like speech from text, focusing on improving naturalness, controllability, and efficiency. Current research emphasizes developing more robust models, such as those incorporating source-filter models, variational autoencoders, and diffusion probabilistic models, often paired with advanced vocoders like HiFi-GAN, to achieve high-fidelity audio. These advancements are crucial for applications ranging from assistive technologies and multimedia production to forensic analysis and language preservation, particularly for low-resource languages. Furthermore, research is actively addressing challenges like detecting synthetic speech and enhancing speaker anonymization techniques.

Papers