Text to Speech

Text-to-speech (TTS) research aims to synthesize natural-sounding human speech from textual input, focusing on improving speech quality, speaker similarity, and efficiency. Current efforts concentrate on developing advanced architectures like diffusion models and transformers, often incorporating techniques such as flow matching and semantic communication to enhance both the naturalness and expressiveness of generated speech. This field is crucial for applications ranging from assistive technologies and accessibility tools to combating deepfakes and creating more realistic synthetic datasets for training other AI models.

Papers

March 17, 2024

Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations
Claudio Pinhanez, Raul Fernandez, Marcelo Grave, Julio Nogima, Ron Hoory
Technical Challenge Text to Speech Synthetic Voice State Aware Guideline African American

March 13, 2024

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech
Ziqi Liang, Haoxiang Shi, Jiawei Wang, Keda Lu
End to End Text to Speech Speech Synthesis Text to Speech Model Mongolian Text to Speech

March 9, 2024

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen
Full Model Text to Speech Synthesized Speech Text to Speech Model Zero Shot Text to Speech Zero Shot Voice Conversion Data Scaling

March 7, 2024

Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation
Sai Akarsh, Vamshi Raghusimha, Anindita Mondal, Anil Vuppala
Text to Speech Manual Effort Stress Detection Model Stress Distribution Stress Annotation

March 5, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
Diffusion Model Text to Speech Speech Synthesis Natural Sounding Speech Speech Representation Disentanglement Vec Tok Codec Speech Naturalness

February 29, 2024

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov
Text to Speech Speech Synthesis Unknown Language Speech Encoder Multilingual Speech Unlabeled Speech

February 26, 2024

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation
Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio Minazzi, Nicola Sobieski, Sebastien Bratieres
Text to Speech High Quality Text to Speech Model

February 22, 2024

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition
Rendi Chevi, Alham Fikri Aji
Text to Speech Prosodic Feature Experienced Emotion Microbial Decomposition Speech Naturalness Prosody Encoder

February 19, 2024

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models
Miri Varshavsky-Hassid, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin
Latent Space Text to Speech Synthesized Speech Denoising Diffusion Model Diffusion Based Text

February 12, 2024

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman
Raw Data Text to Speech Critical Lesson Text to Speech Model Billion Parameter Speech Naturalness

February 9, 2024

A New Approach to Voice Authenticity
Nicolas M. Müller, Piotr Kawa, Shen Hu, Matthias Neu, Jennifer Williams, Philip Sperl, Konstantin Böttinger
Text to Speech Novel Approach Fake Speech Audio Recording Spoken Utterance

February 8, 2024

Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, Sungroh Yoon, Kang Min Yoo
Language Model Text to Speech Dialog Model Joint Speech Text Speech Text

February 5, 2024

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations
Álvaro Martín-Cortinas, Daniel Sáez-Trigueros, Iván Vallés-Pérez, Biel Tura-Vecino, Piotr Biliński, Mateusz Lajszczak, Grzegorz Beringer, Roberto Barra-Chicote, Jaime Lorenzo-Trueba
Text to Speech Core Stability Voice Conversion Disentangled Representation

February 1, 2024

Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech
Dong Yang, Tomoki Koriyama, Yuki Saito
Text to Speech Environment Exploration Self Training Exhaled Breath Breath Detection

January 28, 2024

MunTTS: A Text-to-Speech System for Mundari
Varun Gumma, Rishav Hada, Aditya Yadavalli, Pamir Gogoi, Ishani Mondal, Vivek Seshadri, Kalika Bali
Text to Speech Speech Synthesis Speech Model Language Technology Low Resource Indian Language Text to Speech System

January 25, 2024

January 24, 2024

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages
Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro
Text to Speech Generative Adversarial Indian Language Text to Speech Model Zero Shot Text to Speech Speaker Information State of the Art NVIDIA

January 23, 2024

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization
Wei-Ping Huang, Sung-Feng Huang, Hung-yi Lee
Text to Speech New Initialization Language Adaptation Data Efficiency Effective Transfer Learning Domain Mixing Text to Speech System

January 22, 2024

Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis
Vinotha R, Hepsiba D, L. D. Vijay Anand, Deepak John Reji
Text to Speech Speech Synthesis Timely Communication Synthesized Speech Speech Technology Voice Cloning

Text to Speech

Papers

Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

A New Approach to Voice Authenticity

Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech

MunTTS: A Text-to-Speech System for Mundari

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

Intelli-Z: Toward Intelligible Zero-Shot TTS

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization

Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis