Neural Text to Speech

Neural text-to-speech (TTS) aims to synthesize natural-sounding human speech from text input, focusing on improving both audio quality and expressiveness. Recent research emphasizes end-to-end models, often employing diffusion processes or transformer-based architectures, to directly generate waveforms without intermediate representations, and explores methods to enhance prosodic diversity and control vocal effort for improved intelligibility in noisy environments. These advancements are significant for applications ranging from accessibility technologies to virtual assistants, driving improvements in both the realism and usability of synthetic speech.

Papers

November 2, 2023

E3 TTS: Easy End-to-End Diffusion-based Text to Speech
Yuan Gao, Nobuyuki Morioka, Yu Zhang, Nanxin Chen
Text Modality Speech Analysis Text to Speech Text to Speech Model Latent Structure High Fidelity Audio Neural Text to Speech

October 23, 2023

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes
Seongho Joo, Hyukhun Koh, Kyomin Jung
Speech Analysis Prosodic Feature Determinantal Point Process Speech Segment Neural Text to Speech Current Tt System

May 23, 2023

EfficientSpeech: An On-Device Text to Speech Model
Rowel Atienza
Text to Speech Speech Model Pyramid Transformer Device Use Case Neural Text to Speech

December 15, 2022

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis
Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh
Text to Speech Semantic Information Text to Speech Synthesis Relation Aware Neural Text to Speech

November 1, 2022

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features
Alexandra Vioni, Georgia Maniati, Nikolaos Ellinas, June Sig Sung, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis
Prosodic Feature Linguistic Feature Naturalness Assessment Neural Tt Speech Naturalness Neural Text to Speech

April 2, 2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu
Self Supervised Neural Vocoder High Fidelity Vocoder Text to Speech Synthesis Neural Text to Speech Current Tt System

March 20, 2022

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise
Tuomo Raitio, Petko Petkov, Jiangchuan Li, Muhammed Shifas, Andrea Davis, Yannis Stylianou
Text to Speech Industrial Disturbing Noise Synthesized Speech Speech Intelligibility Vocal Expression Neural Text to Speech

Neural Text to Speech

Papers

E3 TTS: Easy End-to-End Diffusion-based Text to Speech

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

EfficientSpeech: An On-Device Text to Speech Model

RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise