Voice Conversion
Voice conversion (VC) aims to transform a speaker's voice into another's while preserving the original linguistic content. Current research focuses on improving the quality and naturalness of converted speech, particularly in challenging scenarios like cross-lingual conversion and low-resource settings, often employing techniques like diffusion models, generative adversarial networks (GANs), and self-supervised learning with various encoder-decoder architectures. These advancements are significant for applications ranging from personalized voice assistants and accessibility tools to enhancing privacy in speech data and improving speech intelligibility assessment. The field is also actively addressing challenges related to disentangling speaker identity from other speech characteristics and mitigating vulnerabilities to deepfake attacks.
Papers
Improving Voice Conversion for Dissimilar Speakers Using Perceptual Losses
Suhita Ghosh, Yamini Sinha, Ingo Siegert, Sebastian Stober
Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech
Dariusz Piotrowski, Renard Korzeniowski, Alessio Falai, Sebastian Cygert, Kamil Pokora, Georgi Tinchev, Ziyao Zhang, Kayoko Yanagisawa
HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods
Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, Ha-Jin Yu
Controllable Residual Speaker Representation for Voice Conversion
Le Xu, Jiangyan Yi, Jianhua Tao, Tao Wang, Yong Ren, Rongxiu Zhong
AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion
Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda
StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings
Arnab Das, Suhita Ghosh, Tim Polzehl, Sebastian Stober
Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion
Suhita Ghosh, Arnab Das, Yamini Sinha, Ingo Siegert, Tim Polzehl, Sebastian Stober
Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature
Kyungguen Byun, Sunkuk Moon, Erik Visser
Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data
Hyungseob Lim, Kyungguen Byun, Sunkuk Moon, Erik Visser