Speaker Generation

Speaker generation aims to synthesize realistic-sounding speech from nonexistent speakers, focusing on creating diverse and controllable voices. Current research emphasizes methods leveraging pre-trained models, such as text-to-speech systems, combined with techniques like attribute interpolation (e.g., model merging, optimal transport) and prompt-based control to manipulate speaker characteristics from text descriptions. This field is significant for applications in entertainment, accessibility technologies, and data augmentation, while also posing challenges in areas like deepfake detection and speaker de-identification.

Papers