Speech Representation Disentanglement

Speech representation disentanglement aims to separate intertwined aspects of speech, such as speaker identity, linguistic content, emotion, and acoustic characteristics, into independent representations. Current research focuses on developing novel model architectures, including transformers, VAEs, and diffusion models, often incorporating self-supervised learning and adversarial training techniques to achieve effective disentanglement. This field is crucial for advancing applications like voice conversion, speech anonymization, and multi-talker speech recognition, improving their robustness and quality by isolating and manipulating specific speech attributes. The development of large-scale datasets specifically designed for evaluating disentanglement methods is also a significant area of ongoing work.

Papers