Zero Shot Text to Speech
Zero-shot text-to-speech (TTS) aims to synthesize speech from unseen speakers using only a short audio sample as a reference, eliminating the need for speaker-specific training data. Current research focuses on improving the naturalness, robustness, and efficiency of these systems, employing various architectures such as diffusion models, flow-matching models, and large language models operating on discrete audio codes. These advancements are significant because they enable more accessible and versatile speech synthesis applications, including personalized voice assistants, audiobook generation, and assistive technologies for individuals with communication impairments. Furthermore, the field is actively addressing challenges like noise robustness and efficient inference for real-world deployment.
Papers
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei