Text to Speech

Text-to-speech (TTS) research aims to synthesize natural-sounding human speech from textual input, focusing on improving speech quality, speaker similarity, and efficiency. Current efforts concentrate on developing advanced architectures like diffusion models and transformers, often incorporating techniques such as flow matching and semantic communication to enhance both the naturalness and expressiveness of generated speech. This field is crucial for applications ranging from assistive technologies and accessibility tools to combating deepfakes and creating more realistic synthetic datasets for training other AI models.

Papers

October 17, 2024

DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis
Yu Gu, Qiushi Zhu, Guangzhi Lei, Chao Weng, Dan Su
Variational Autoencoder Adversarial Learning Text to Speech Adaptive Importance Temporal Attention Text to Speech Model Text to Speech Synthesis

October 16, 2024

Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR
Christoph Minixhofer, Ondrej Klejch, Peter Bell
Automatic Speech Recognition Text to Speech Denoising Diffusion Probabilistic Model Critical Synthesis ASR Model ASR System Mean Squared Error

October 12, 2024

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
Rui Liu, Zhenqi Jia, Jie Yang, Yifan Hu, Haizhou Li
Text to Speech Conversational Dataset Interactive Rendering Multi Scale Contextual Information Dense Annotation Text Rendering Emphasis Detection

October 9, 2024

Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS
Onkar Kishor Susladkar, Vishesh Tripathi, Biddwan Ahmed
Text to Speech Synthesized Speech

October 7, 2024

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech
Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim
Text to Speech Implicit Neural Representation Text Sequence Sequence Alignment

October 6, 2024

October 4, 2024

Generative Semantic Communication for Text-to-Speech Synthesis
Jiahao Zheng, Jinke Ren, Peng Xu, Zhihao Yuan, Jie Xu, Fangxin Wang, Gui Gui, Shuguang Cui
Text to Speech Semantic Communication Text to Speech Synthesis Generative Semantic Communication

September 28, 2024

FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency
Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li
Text to Speech Prosodic Feature Text Editing Speech Editing

September 25, 2024

Exploring synthetic data for cross-speaker style transfer in style representation based TTS
Lucas H. Ueda, Leonardo B. de M. M. Marques, Flávio O. Simões, Mário U. Neto, Fernando Runstein, Bianca Dal Bó, Paula D. P. Costa
Synthetic Data Style Transfer Text to Speech Voice Conversion Style Representation Accent Transfer

September 20, 2024

Zero-shot Cross-lingual Voice Transfer for TTS
Fadi Biadsy, Youzheng Chen, Isaac Elias, Kyle Kastner, Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran
Zero Shot Text to Speech Speech Restoration

September 18, 2024

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li
Diffusion Model Text to Speech Diffusion Transformer Temporal Modeling

September 15, 2024

Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning
Siqi Sun, Korin Richmond
Multi Task Learning Text to Speech

September 14, 2024

E1 TTS: Simple and Fast Non-Autoregressive TTS
Zhijun Liu, Shuai Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li
Text to Speech Speaker Similarity Audio Sample Non Autoregressive Text to Speech

September 13, 2024

September 10, 2024

Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation
Podakanti Satyajith Chary
Speech Analysis Text to Speech Speech Generation Prosody Control

September 8, 2024

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion
Zhengyang Chen, Shuai Wang, Mingyang Zhang, Xuechen Liu, Junichi Yamagishi, Yanmin Qian
Context Learning Pre Trained Model Text to Speech Prosodic Feature Semantic Information Emotional Speech Source Speech Zero Shot Voice Conversion

September 4, 2024

Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems
Jeongmin Liu, Eunwoo Song
Text to Speech Content Based Feature High Fidelity Vocoder

September 3, 2024

VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka
Li-Wei Chen, Hung-Shin Lee, Chen-Chi Chang
Speech Recognition Text to Speech Speech Synthesis Chinese Character Mandarin Speech

Text to Speech

Papers

DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis

Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis

SONAR: A Synthetic AI-Audio Detection Framework~and Benchmark

Generative Semantic Communication for Text-to-Speech Synthesis

FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Zero-shot Cross-lingual Voice Transfer for TTS

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning

E1 TTS: Simple and Fast Non-Autoregressive TTS

DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

Text-To-Speech Synthesis In The Wild

Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion

Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems

VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka