Neural Text to Speech

Neural text-to-speech (TTS) aims to synthesize natural-sounding human speech from text input, focusing on improving both audio quality and expressiveness. Recent research emphasizes end-to-end models, often employing diffusion processes or transformer-based architectures, to directly generate waveforms without intermediate representations, and explores methods to enhance prosodic diversity and control vocal effort for improved intelligibility in noisy environments. These advancements are significant for applications ranging from accessibility technologies to virtual assistants, driving improvements in both the realism and usability of synthetic speech.

Papers