Intermediate Speech Representation
Intermediate speech representations are crucial for advanced speech processing tasks, aiming to capture essential acoustic information for applications like text-to-speech synthesis and speaker recognition. Current research focuses on developing more effective representations, moving beyond traditional methods like mel-spectrograms towards learned, high-dimensional embeddings from models like VQ-GANs and Wav2Vec 2.0, often within end-to-end architectures or two-stage pipelines. These improvements enhance speech synthesis quality, robustness to noise, and enable new capabilities such as speaker anonymization and zero-shot synthesis, while also raising concerns about potential security vulnerabilities related to data privacy.
Papers
January 9, 2023
July 11, 2022
April 28, 2022