Text to Speech

Text-to-speech (TTS) research aims to synthesize natural-sounding human speech from textual input, focusing on improving speech quality, speaker similarity, and efficiency. Current efforts concentrate on developing advanced architectures like diffusion models and transformers, often incorporating techniques such as flow matching and semantic communication to enhance both the naturalness and expressiveness of generated speech. This field is crucial for applications ranging from assistive technologies and accessibility tools to combating deepfakes and creating more realistic synthetic datasets for training other AI models.

Papers

July 4, 2022

Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)
Ziyao Zhang, Alessio Falai, Ariadna Sanchez, Orazio Angelini, Kayoko Yanagisawa
Empirical Study Text to Speech Speech Synthesis Training Corpus Based Mix Monolingual Corpus

July 3, 2022

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech
Keon Lee, Kyumin Park, Daeyoung Kim
Text to Speech Open Domain Dialogue Dialogue Datasets Multi Speaker Tt Speech System

July 1, 2022

June 30, 2022

June 29, 2022

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody
Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou
Text to Speech Prosodic Feature Expressive Speech

June 27, 2022

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding
Wei-Ping Huang, Po-Chun Chen, Sung-Feng Huang, Hung-yi Lee
Transfer Learning Text to Speech Speech to Text Text to Speech Model Shot Training Cross Lingual Text to Speech

June 24, 2022

June 16, 2022

Automatic Prosody Annotation with Pre-Trained Text-Speech Model
Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li, Deng Cai, Dong Yu
Text to Speech Prosodic Feature Pre Trained Speech Model Prosody Modeling

June 6, 2022

Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE
Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu
Text to Speech Speech Synthesis Speech Representation Self Supervised Speech Representation Learning

June 5, 2022

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye
End to End Text to Speech Speech to Text Semantic Prior Geographic Feature Pronunciation Polyphone Disambiguation

May 30, 2022

May 24, 2022

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao
Text to Speech Multi Speaker Text to Speech Personalized Speech Low Resource Text to Speech

May 9, 2022

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech
Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu, Ying Wen, Yang Yang, Jun Wang
Text to Speech Prosodic Feature Lung VAE Cross Utterance Non Autoregressive Text to Speech Utterance Information

May 6, 2022

Hearing voices at the National Library -- a speech corpus and acoustic model for the Swedish language
Martin Malmsten, Chris Haffenden, Love Börjeson
Automatic Speech Recognition Speech Recognition Text to Speech Easy to Use Library Human VOICE Speech Corpus Acoustic Model

April 14, 2022

Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech
Cong Zhang, Huinan Zeng, Huang Liu, Jiewen Zheng
Text to Speech Code Switched Phonological Feature Non Native

April 11, 2022

Fine-grained Noise Control for Multispeaker Speech Synthesis
Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis
Text to Speech Speech Synthesis Utterance Level Fine Grained Control Prosody Modeling Noise Representation Fine Grained Prosody

Text to Speech

Papers

Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

Building African Voices

Swiss German Speech to Text system evaluation

R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding

Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

Hearing voices at the National Library -- a speech corpus and acoustic model for the Swedish language

Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech

Fine-grained Noise Control for Multispeaker Speech Synthesis