Text to Speech Synthesis

Text-to-speech (TTS) synthesis aims to convert written text into natural-sounding speech, focusing on improving both the quality and efficiency of generated audio. Current research emphasizes developing faster and more lightweight models, often employing diffusion models, autoregressive methods, and transformer architectures, while also exploring techniques like post-training quantization to reduce computational demands. These advancements are significant for expanding access to speech technologies across diverse languages and resource-constrained environments, impacting fields ranging from accessibility tools to personalized communication systems.

Papers

October 2, 2023

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation
Roi Benita, Michael Elad, Joseph Keshet
Language Generation Speech Generation High Fidelity Vocoder Text to Speech Synthesis Speech Waveform AutoRegressive Diffusion

September 21, 2023

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng
Language Understanding Speech Encoder Zero Shot Text to Speech Text to Speech Synthesis Zero Shot Speaker Adaptation Acoustic Prompt

August 2, 2023

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
Ramanan Sivaguru, Vasista Sai Lodagala, S Umesh
Synthesized Speech Speech Quality Self Supervised Speech Representation Text to Speech Synthesis Fastspeech2 Architecture

June 16, 2023

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models
Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, Julian McAuley
Unlabeled Video Modality Gap Text to Speech Synthesis Audio Visual Correspondence Audio Driven Visual Synthesis

May 21, 2023

VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages
Shivam Mhaskar, Vineet Bhat, Akshay Batheja, Sourabh Deoghare, Paramveer Choudhary, Pushpak Bhattacharyya
Machine Translation Indian Language Text to Speech Synthesis Text Machine Translation

March 7, 2023

Do Prosody Transfer Models Transfer Prosody?
Atli Thor Sigurgeirsson, Simon King
Synthesized Speech Speech Generation Text to Speech Synthesis

March 1, 2023

ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations
Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha Sahipjohn, Niranjan Pedanekar, Vineet Gandhi
Indian Language Self Supervised Speech Representation Multilingual Scenario Text to Speech Synthesis Multilingual Tt

December 16, 2022

Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder
Yusuke Yasuda, Tomoki Toda
Variational Autoencoder Diffusion Explainer Probabilistic Model Text to Speech Synthesis Latent Sequence

December 15, 2022

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis
Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh
Text to Speech Semantic Information Text to Speech Synthesis Relation Aware Neural Text to Speech

October 26, 2022

Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari
Text to Speech Speech Data Synthesized Speech Data Selection Speech Corpus Text to Speech Model Text to Speech Synthesis

April 3, 2022

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng
Fine Grained Speaker Embeddings Speech Encoder Speaker Similarity Text to Speech Synthesis Zero Shot Speaker Adaptation

April 2, 2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu
Self Supervised Neural Vocoder High Fidelity Vocoder Text to Speech Synthesis Current Tt System Neural Text to Speech

March 21, 2022

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling
Bac Nguyen, Fabien Cardinaux, Stefan Uhlich
End to End Speech Synthesis Text to Speech Synthesis Duration Modelling

Text to Speech Synthesis

Papers

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages

Do Prosody Transfer Models Transfer Prosody?

ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations

Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis

Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling