Non Autoregressive Text to Speech

Non-autoregressive text-to-speech (TTS) aims to synthesize speech from text significantly faster than traditional autoregressive methods by generating the entire audio output in parallel. Current research focuses on improving the naturalness and speaker similarity of non-autoregressive TTS, employing techniques like diffusion models, masked generative transformers, and variational autoencoders to achieve this goal, often incorporating speaker embeddings and probabilistic duration modeling for enhanced control and realism. These advancements offer the potential for more efficient and versatile speech synthesis applications, particularly in real-time systems and those requiring diverse speaker voices.

Papers