Fastspeech2 Architecture

FastSpeech2 is a neural text-to-speech (TTS) model aiming to generate high-quality, natural-sounding speech from text input. Current research focuses on improving FastSpeech2's performance through techniques like integrating self-supervised learning representations for richer speech characteristics, incorporating emotional expression via conditioning mechanisms, and developing end-to-end training methods with vocoders like HiFi-GAN to streamline the pipeline and enhance synthesis quality. These advancements are significant for improving accessibility (e.g., for visually impaired individuals) and creating more expressive and human-like synthetic speech in various applications.

Papers

July 19, 2024

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2
Chun Xu, En-Wei Sun
Single CLIP Audio Generation Synchronous Generator High Quality Speech Braille Letter Reading Fastspeech2 Architecture Speech to Image

August 2, 2023

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
Ramanan Sivaguru, Vasista Sai Lodagala, S Umesh
Synthesized Speech Speech Quality Self Supervised Speech Representation Text to Speech Synthesis Fastspeech2 Architecture

June 28, 2023

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech
Daria Diatlova, Vitaly Shutov
Speech Analysis Speech Synthesis Underlying Emotion Speech to Text Emotional Speech Emotion Model Fastspeech2 Architecture

March 31, 2022

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech
Dan Lim, Sunghee Jung, Eesung Kim
Speech Analysis Neural Vocoder Acoustic Feature Annotated End State HiFi GAN Speech Text Alignment Multiple Jet Fastspeech2 Architecture

Fastspeech2 Architecture

Papers

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech