Speaker Generation

Speaker generation aims to synthesize realistic-sounding speech from nonexistent speakers, focusing on creating diverse and controllable voices. Current research emphasizes methods leveraging pre-trained models, such as text-to-speech systems, combined with techniques like attribute interpolation (e.g., model merging, optimal transport) and prompt-based control to manipulate speaker characteristics from text descriptions. This field is significant for applications in entertainment, accessibility technologies, and data augmentation, while also posing challenges in areas like deepfake detection and speaker de-identification.

Papers

December 28, 2024

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung
Human Language Speech Synthesis Speaker Generation

June 30, 2024

An Attribute Interpolation Method in Speech Synthesis by Model Merging
Masato Murata, Koichi Miyazaki, Tomoki Koriyama
Speech Synthesis Model Merging Speaker Generation

June 13, 2024

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems
Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian
Speech Synthesis Synthesized Speech Multi Speaker Text to Speech Impression Generation Speaker Generation Prompt Based Text to Speech

February 13, 2024

Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues
Maneesh Bilalpur, Mert Inan, Dorsa Zeinali, Jeffrey F. Cohn, Malihe Alikhani
LeArning Abstract Financial Application Embodied AI Emotion Intensity Face to Face Speaker Generation Backchannel Prediction

October 8, 2023

PromptSpeaker: Speaker Generation Based on Text Descriptions
Yongmao Zhang, Guanghou Liu, Yi Lei, Yunlin Chen, Hao Yin, Lei Xie, Zhifei Li
Synthetic Voice Text Description Speaker Generation

June 2, 2023

Improved DeepFake Detection Using Whisper Features
Piotr Kawa, Marcin Plata, Michał Czuba, Piotr Szymański, Piotr Syga
Deepfake Detection State of the Art Whisper Deepfake Audio Whisper Model Speaker Generation

October 18, 2022

Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models
Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari
Optimal Transport Gaussian Mixture Model Speaker Embeddings Speaker Characteristic Speaker Independent Speaker Generation

September 9, 2022

DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
Ruibin Yuan, Yuxuan Wu, Jacob Li, Jaxter Kim
Speaker Verification De Identification Zero Shot Voice Conversion Speaker Generation

March 31, 2022

HiFi-VC: High Quality ASR-Based Voice Conversion
A. Kashkin, I. Karpukhin, S. Shishkin
Speech Recognition Voice Conversion High Quality Speech System Speaker Generation

November 7, 2021

Speaker Generation
Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao
Transfer Learning Speaker Verification Synthetic Voice Speaker Similarity Speaker Generation