Text to Speech Model

Text-to-speech (TTS) models aim to synthesize natural-sounding human speech from text input, focusing on improving both the quality and controllability of generated audio. Current research emphasizes enhancing model architectures like Transformers and diffusion models, incorporating techniques such as preference alignment, adversarial training, and hierarchical acoustic modeling to achieve higher fidelity, speaker consistency, and emotional expressiveness. These advancements are significant for applications ranging from accessibility tools for the visually impaired to personalized voice assistants and improved synthetic data generation for other AI tasks.

Papers

February 2, 2024

Natural language guidance of high-fidelity text-to-speech with synthetic annotations
Dan Lyth, Simon King
Natural Language Instruction Speaker Identity Speech Generation Text to Speech Model Speech Language Model Synthetic Annotation

January 24, 2024

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages
Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro
Text to Speech Generative Adversarial Indian Language Text to Speech Model Zero Shot Text to Speech Speaker Information State of the Art NVIDIA

December 21, 2023

EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
Maureen de Seyssel, Antony D'Avirro, Adina Williams, Emmanuel Dupoux
Prosodic Feature Text to Speech Model Speech to Speech Translation Special Emphasis Speech Resynthesis Emphasis Detection

November 17, 2023

A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness
Mathias Vogel
Latent Space Study Feature Text to Speech Speech Model Text to Speech Model Latent Speech 1 WL Expressiveness

November 2, 2023

E3 TTS: Easy End-to-End Diffusion-based Text to Speech
Yuan Gao, Nobuyuki Morioka, Yu Zhang, Nanxin Chen
Text Modality Speech Analysis Text to Speech Text to Speech Model Latent Structure High Fidelity Audio Neural Text to Speech

October 22, 2023

An overview of text-to-speech systems and media applications
Mohammad Reza Hasanabadi
Text to Speech System Description Speech to Text Text to Speech Model Synthetic Voice

October 13, 2023

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation
Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg
Context Learning Speech Recognition Speech Translation View Translation Text to Speech Model Augmented Language Model Zero Shot in Context

October 8, 2023

Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset
Ze Liu
Transfer Learning Comparative Study Text to Speech Low Resource Text to Speech Model

September 29, 2023

ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech
Wenhao Guan, Qi Su, Haodong Zhou, Shiyu Miao, Xingjia Xie, Lin Li, Qingyang Hong
Speech Synthesis Text to Speech Model Rectified Flow Speech Resynthesis

September 15, 2023

Fewer-token Neural Speech Codec with Time-invariant Codes
Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chuyuan Zhang, Junzuo Zhou
Text to Speech Text to Speech Model Zero Shot Text to Speech Neural Speech

August 31, 2023

LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech
Jie Chen, Xingchen Song, Zhendong Peng, Binbin Zhang, Fuping Pan, Zhiyong Wu
Text to Speech Diffusion Probabilistic Model Inference Latency Text to Speech Model Diffusion Decoder Streaming Inference

August 28, 2023

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo Kang
Text to Speech Text to Speech Model Personalized Speech

July 10, 2023

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task
Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang, Guoqing Zhao
Automatic Speech Recognition Model Multi Source Text to Speech Model Speech to Speech Translation Speech Translation Task

June 1, 2023

SlothSpeech: Denial-of-service Attack Against Speech Recognition Models
Mirazul Haque, Rutvij Shah, Simin Chen, Berrak Şişman, Cong Liu, Wei Yang
Automatic Speech Recognition Automatic Speech Recognition Model Text to Speech Model Speech Recognition Model Denial of Service Attack

May 31, 2023

Text-to-Speech Pipeline for Swiss German -- A comparison
Tobias Bollinger, Jan Deriu, Manfred Vogel
Consistent Comparison Text to Speech Speech Synthesis Generative Adversarial Text to Speech Model Swiss German

May 28, 2023

Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS
Sewade Ogun, Vincent Colotte, Emmanuel Vincent
Speech Analysis Text to Speech Diversity Awareness Text to Speech Model Visual Naturalness Stochastic Pitch Prediction

May 25, 2023

Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration
Rustem Yeshpanov, Saida Mussakhojayeva, Yerbolat Khassanov
End to End Speech Synthesis Text to Speech Model Turkish Text Cyrillic Latin Script Transliteration Mongolian Text to Speech

May 18, 2023

A unified front-end framework for English text-to-speech synthesis
Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, Yuanyuan Huo, Yuxuan Wang
Text to Speech Critical Synthesis Text to Speech Model Unifying Framework

April 25, 2023

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge
Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu
Challenge Task Speaker Embeddings Language Representation Text to Speech Model High Fidelity Vocoder

March 29, 2023

AraSpot: Arabic Spoken Command Spotting
Mahmoud Salhab, Haidar Harmanani
Data Augmentation Text to Speech Model Spoken Language Arabic Word