Zero Shot Voice Conversion

Zero-shot voice conversion (ZSVC) aims to transform a speaker's voice into that of an unseen target speaker without paired training data, preserving the original speech content. Current research heavily focuses on disentangling content and speaker characteristics within various generative models, including variational autoencoders (VAEs), generative adversarial networks (GANs), and language models (LMs), often employing techniques like clustering, cross-attention, and iterative refinement to improve speaker similarity and naturalness. These advancements hold significant potential for applications in personalized speech synthesis, voice anonymization, and accessibility technologies, while also pushing the boundaries of speech representation learning and generative modeling.

Papers