Text to Speech

Text-to-speech (TTS) research aims to synthesize natural-sounding human speech from textual input, focusing on improving speech quality, speaker similarity, and efficiency. Current efforts concentrate on developing advanced architectures like diffusion models and transformers, often incorporating techniques such as flow matching and semantic communication to enhance both the naturalness and expressiveness of generated speech. This field is crucial for applications ranging from assistive technologies and accessibility tools to combating deepfakes and creating more realistic synthetic datasets for training other AI models.

Papers

September 6, 2023

September 5, 2023

PromptTTS 2: Describing and Generating Voices with Text Prompt
Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian
Text to Speech Speech Generation Speech Datasets

September 4, 2023

A Comparative Analysis of Pretrained Language Models for Text-to-Speech
Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman
Language Model Comparative Study Text to Speech Pretrained Language Model Natural Sounding Speech Prosody Prediction

September 2, 2023

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin
Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie
Diffusion Model Text to Speech Chinese Character Target Language Multilingual Speech Cross Lingual Emotion Cross Lingual Text to Speech

August 31, 2023

LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech
Jie Chen, Xingchen Song, Zhendong Peng, Binbin Zhang, Fuping Pan, Zhiyong Wu
Text to Speech Diffusion Probabilistic Model Inference Latency Text to Speech Model Diffusion Decoder Streaming Inference

August 28, 2023

August 12, 2023

Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation
Zhichao Wang, Mengyu Dai, Keld Lundgaard
Zero Shot Text to Speech Text to Video Text to Video Generation Two Stage Talking Head Video Based Person Re Identification

July 31, 2023

Multilingual context-based pronunciation learning for Text-to-Speech
Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime Lorenzo-Trueba
Text to Speech Multilingual Model Speech to Text Phonetic Information Grapheme to Phoneme Contextual Adapter

July 29, 2023

\`{I}r\`{o}y\`{i}nSpeech: A multi-purpose Yor\`{u}b\'{a} Speech Corpus
Tolulope Ogunremi, Kola Tubosun, Anuoluwapo Aremu, Iroro Orife, David Ifeoluwa Adelani
Automatic Speech Recognition Text to Speech Speech Data Speech Corpus

July 28, 2023

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding
Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang
Language Model Comparative Study Text to Speech Conditional Diffusion Model Semantic Encoding Discrete Speech Representation Supervised Text to Speech

July 13, 2023

Controllable Emphasis with zero data for text-to-speech
Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova
Raw Data Text to Speech Special Emphasis Phoneme Duration Emphasis Detection

July 11, 2023

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis
Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely
Text to Speech Speech Synthesis Greater Public Use Speech Representation Synthesized Speech Self Supervised Speech Representation Mean Opinion Score Spontaneous Speech Synthesis

June 27, 2023

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu, Chunfeng Wang, Yi Ren, Xiang Yin, Zejun Ma
Text to Speech Speech Synthesis Better Generalization Speech Representation Disentanglement Multilingual Speech Timbre Perception Speaking Style

June 25, 2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech
Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu
Text to Speech Speech Synthesis High Fidelity Vocoder High Fidelity Speech

June 16, 2023

CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
Frederico S. Oliveira, Edresson Casanova, Arnaldo Cândido Júnior, Anderson S. Soares, Arlindo R. Galvão Filho
Low Resource Language Text to Speech Speech Synthesis Multilingual Dataset Librispeech Speech Recognition Controllable Text to Speech

June 14, 2023

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation
Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen
Data Augmentation Automatic Speech Recognition Entity Recognition Text to Speech Code Switching Speech Editing Text Based Speech Editing

June 13, 2023