Text to Speech

Text-to-speech (TTS) research aims to synthesize natural-sounding human speech from textual input, focusing on improving speech quality, speaker similarity, and efficiency. Current efforts concentrate on developing advanced architectures like diffusion models and transformers, often incorporating techniques such as flow matching and semantic communication to enhance both the naturalness and expressiveness of generated speech. This field is crucial for applications ranging from assistive technologies and accessibility tools to combating deepfakes and creating more realistic synthetic datasets for training other AI models.

Papers

July 7, 2024

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan
Text to Speech Zero Shot Text to Speech Semantic Token Speech Token

July 4, 2024

On the Effectiveness of Acoustic BPE in Decoder-Only TTS
Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu
Text to Speech

July 2, 2024

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang
Zero Shot Text to Speech Efficient Inference Zero Shot Text to Speech Reverse Inference Optimization

June 27, 2024

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability
Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han
Text to Speech Speech Synthesis Temporal Variation Diffusion Based Text Expressive Text to Speech

June 26, 2024

June 25, 2024

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation
Yingting Li, Ambuj Mehrish, Bryan Chew, Bo Cheng, Soujanya Poria
Text to Speech Speech Synthesis Multilingual Dataset Parameter Efficient Transfer Learning

June 21, 2024

InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions
Yu Nakagome, Michael Hentschel
Text to Speech CTC Based End to End Speech Recognition Unseen Speaker

June 16, 2024

Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech
Guan-Ting Lin, Wei-Ping Huang, Hung-yi Lee
Speech Recognition Text to Speech Noisy Speech Continual Test Time Adaptation Supervised Text to Speech

June 15, 2024

GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan
Text to Speech High Quality Speech Expressive Speech Synthesis Glottal Source

June 13, 2024

June 12, 2024

June 11, 2024

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng
Text to Speech Speech Data Speech to Speech Translation Direct Speech to Speech Translation

June 10, 2024

June 8, 2024

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter
Text to Speech Text to Speech Model Spontaneous Speech Non Autoregressive Text to Speech Duration Modelling

June 7, 2024

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber
Text to Speech Zero Shot Text to Speech Voice Cloning Multilingual Training

Text to Speech

Papers

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

On the Effectiveness of Acoustic BPE in Decoder-Only TTS

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

LLM-Driven Multimodal Opinion Expression Identification

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions

Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech

GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Controlling Emotion in Text-to-Speech with Natural Language Prompts

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model