One Shot Voice Conversion

One-shot voice conversion aims to transform a speaker's voice to mimic another using only a single, short audio sample of the target voice. Current research heavily focuses on disentangling the speaker's identity from the speech content, employing techniques like generative adversarial networks, transformers (including Conformer and Zipformer blocks), and vector quantization, often enhanced by contrastive learning and mutual information estimation to improve representation learning. This field is significant for its potential applications in personalized voice assistants, accessibility technologies for individuals with speech impairments, and creative audio manipulation, driving advancements in speech representation and generative modeling.

Papers