Paper ID: 2504.13791 • Published Apr 18, 2025
Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion
Sandipan Dhar, Md. Tousin Akhter, Nanda Dulal Jana, Swagatam Das
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
After demonstrating significant success in image synthesis, Generative
Adversarial Network (GAN) models have likewise made significant progress in the
field of speech synthesis, leveraging their capacity to adapt the precise
distribution of target data through adversarial learning processes. Notably, in
the realm of State-Of-The-Art (SOTA) GAN-based Voice Conversion (VC) models,
there exists a substantial disparity in naturalness between real and
GAN-generated speech samples. Furthermore, while many GAN models currently
operate on a single generator discriminator learning approach, optimizing
target data distribution is more effectively achievable through a single
generator multi-discriminator learning scheme. Hence, this study introduces a
novel GAN model named Collective Learning Mechanism-based Optimal Transport GAN
(CLOT-GAN) model, incorporating multiple discriminators, including the Deep
Convolutional Neural Network (DCNN) model, Vision Transformer (ViT), and
conformer. The objective of integrating various discriminators lies in their
ability to comprehend the formant distribution of mel-spectrograms, facilitated
by a collective learning mechanism. Simultaneously, the inclusion of Optimal
Transport (OT) loss aims to precisely bridge the gap between the source and
target data distribution, employing the principles of OT theory. The
experimental validation on VCC 2018, VCTK, and CMU-Arctic datasets confirms
that the CLOT-GAN-VC model outperforms existing VC models in objective and
subjective assessments.