Synthesized Speech

Synthesized speech research focuses on creating realistic and natural-sounding artificial speech, primarily for applications like voice assistants, audiobooks, and accessibility tools. Current efforts concentrate on improving the naturalness and expressiveness of synthesized speech, often using deep learning models like GANs, diffusion models, and transformers, and addressing challenges such as detecting synthetic speech (deepfakes) and mitigating biases in these detection systems. This field is crucial for advancing human-computer interaction, improving accessibility technologies, and combating the malicious use of synthetic audio in fraud and disinformation.

Papers

March 24, 2023

Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki
Speech Synthesis Generative Adversarial Synthesized Speech Neural Vocoder Speech Synthesizer HiFi GAN Lightweight Detection

March 7, 2023

Do Prosody Transfer Models Transfer Prosody?
Atli Thor Sigurgeirsson, Simon King
Synthesized Speech Speech Generation Text to Speech Synthesis

March 2, 2023

Leveraging Large Text Corpora for End-to-End Speech Summarization
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura
Automatic Speech Recognition Large Corpus Text to Speech Synthesized Speech Summarization Model Speech Summarization

February 27, 2023

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech
Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari
Language Model Synthesized Speech Text to Speech Model Multi Speaker Tt Inappropriate Pause

February 24, 2023

PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS
Junhyeok Lee, Wonbin Jung, Hyunjae Cho, Jaeyeon Kim, Jaehwan Kim
End to End Synthesized Speech Fundamental Frequency Pitch Controllability Graph PIT

February 16, 2023

QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion
Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro
Automatic Speech Recognition Voice Conversion Synthesized Speech Automated Conversion Inverse Short Time Fourier Transform End to End Tt System

February 8, 2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech
Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky
Text Modality Text to Speech Speech Synthesis Synthesized Speech Vector Quantization Spontaneous Speech Speech Synthesizer

January 31, 2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng
Text to Speech Synthesized Speech Natural Language Prompt Speaking Style Expressive Text to Speech

January 26, 2023

On granularity of prosodic representations in expressive text-to-speech
Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz, Daniel Korzekwa, Viacheslav Klimkov
Synthesized Speech Prosodic Feature Utterance Level Justifiable Granularity Expressive Text to Speech

January 17, 2023

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module
Ondřej Plátek, Ondřej Dušek
Synthesized Speech Supervised Metric Mean Opinion Score Probabilistic Linear Discriminant Analysis

November 29, 2022

Evaluating and reducing the distance between synthetic and real speech distributions
Christoph Minixhofer, Ondřej Klejch, Peter Bell
Text to Speech Speech Data Synthesized Speech Distance Matter Natural Sounding Speech Utterance Level Phoneme Duration

November 25, 2022

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices
Oliver Watts, Lovisa Wihlborg, Cassia Valentini-Botinhao
Synthesized Speech Neural Vocoder Modern Vocoders HiFi GAN Band Speech

November 23, 2022

IMaSC -- ICFOSS Malayalam Speech Corpus
Deepa P Gopinath, Thennal D K, Vrinda V Nair, Swaraj K S, Sachin G
Text to Speech Synthesized Speech Speech Corpus Text to Speech Model

November 21, 2022

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System
Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda
Synthesized Speech Neural Speech Synthesis Controllable Speech Synthesis

November 7, 2022

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder
Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans
Synthesized Speech Accented Speech Conditional Variational Autoencoder

November 4, 2022

Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts
Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari
Synthesized Speech Prosodic Feature Acoustic Context Textual Context Japanese Corpus Audiobook Speech Synthesis

November 2, 2022

October 31, 2022

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents
Yongmao Zhang, Zhichao Wang, Peiji Yang, Hongshen Sun, Zhisheng Wang, Lei Xie
Synthesized Speech Accented Speech Crowd Sourced Data Accent Recognition Accent Transfer

October 28, 2022

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders
Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo Köhler, Qing He
Synthesized Speech Speaker Embeddings Speaker Identity Utterance Level Audio Editing Acoustic Context Text Based Speech Editing

Synthesized Speech

Papers

Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

Do Prosody Transfer Models Transfer Prosody?

Leveraging Large Text Corpora for End-to-End Speech Summarization

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS

QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

On granularity of prosodic representations in expressive text-to-speech

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Evaluating and reducing the distance between synthetic and real speech distributions

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

IMaSC -- ICFOSS Malayalam Speech Corpus

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition

AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders