Paper ID: 2311.10804

A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness

Mathias Vogel

This report explores the challenge of enhancing expressiveness control in Text-to-Speech (TTS) models by augmenting a frozen pretrained model with a Diffusion Model that is conditioned on joint semantic audio/text embeddings. The paper identifies the challenges encountered when working with a VAE-based TTS model and evaluates different image-to-image methods for altering latent speech features. Our results offer valuable insights into the complexities of adding expressiveness control to TTS systems and open avenues for future research in this direction.

Submitted: Nov 17, 2023

Topics

Latent Space
Study Feature
Text to Speech
Speech Model
Text to Speech Model
Latent Speech
1 WL Expressiveness

Links

arXiv PDF