Synthesized Speech

Synthesized speech research focuses on creating realistic and natural-sounding artificial speech, primarily for applications like voice assistants, audiobooks, and accessibility tools. Current efforts concentrate on improving the naturalness and expressiveness of synthesized speech, often using deep learning models like GANs, diffusion models, and transformers, and addressing challenges such as detecting synthetic speech (deepfakes) and mitigating biases in these detection systems. This field is crucial for advancing human-computer interaction, improving accessibility technologies, and combating the malicious use of synthetic audio in fraud and disinformation.

Papers

April 8, 2022

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
Jae-Sung Bae, Jinhyeok Yang, Tae-Jun Bak, Young-Sun Joo
Synthesized Speech Prosodic Feature Diverse Set Text to Speech Model Natural Sounding Speech Non Autoregressive Text to Speech

April 7, 2022

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores
Wei-Cheng Tseng, Wei-Tsung Kao, Hung-yi Lee
Speech Synthesis Synthesized Speech Domain Adaptive Product Distribution Multiple Critic Mean Opinion Score Opinion Distribution

April 6, 2022

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Shun Lei, Yixuan Zhou, Liyang Chen, Jiankun Hu, Zhiyong Wu, Shiyin Kang, Helen Meng
Synthesized Speech Style Representation Expressive Speech Synthesis Hierarchical Context

April 4, 2022

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck
Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, Hoirin Kim
Transfer Learning Speaker Verification Voice Conversion Synthesized Speech Anti Spoofing Synthetic Voice Variational Information Bottleneck

April 3, 2022

On incorporating social speaker characteristics in synthetic speech
Sai Sirisha Rallabandi, Sebastian Möller
Speech Synthesis Synthesized Speech Acoustic Feature Speaker Characteristic Female Speaker Vocal Feature

April 1, 2022

Text-To-Speech Data Augmentation for Low Resource Speech Recognition
Rodolfo Zevallos
Data Augmentation Automatic Speech Recognition Text to Speech Synthesized Speech Automatic Speech Recognition Model

March 24, 2022

Does human speech follow Benford's Law?
Leo Hsu, Visar Berisha
Speech Analysis Synthesized Speech Legal Text Human Speech

March 23, 2022

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis
Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen Meng
Knowledge Distillation Synthesized Speech Hierarchical Transformer Encoder Expressive Speech Synthesis Mandarin Speech Hierarchical Context

March 21, 2022

The VoiceMOS Challenge 2022
Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi
Synthesized Speech Mean Opinion Score

March 20, 2022

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise
Tuomo Raitio, Petko Petkov, Jiangchuan Li, Muhammed Shifas, Andrea Davis, Yannis Stylianou
Text to Speech Industrial Disturbing Noise Synthesized Speech Speech Intelligibility Vocal Expression Neural Text to Speech

March 2, 2022

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis
Pengyu Cheng, Zhenhua Ling
Speech Synthesis Synthesized Speech Prosodic Feature Speaker Adaptation

February 22, 2022

Improving Cross-lingual Speech Synthesis with Triplet Training Scheme
Jianhao Ye, Hongbin Zhou, Zhiba Su, Wendi He, Kaimeng Ren, Lin Li, Heng Lu
Text to Speech Speech Synthesis Synthesized Speech Multilingual Speech Triplet Learning Cross Lingual Text to Speech

February 13, 2022

Distribution augmentation for low-resource expressive text-to-speech
Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova
Data Augmentation Text to Speech Synthesized Speech Novel Data Augmentation

January 27, 2022

Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric Speech Recognition
Mohammad Soleymanpour, Michael T. Johnson, Rahim Soleymanpour, Jeffrey Berry
Synthesized Speech Dysarthric Speech Multi Speaker Text to Speech Dysarthric Speech Recognition

January 24, 2022

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end
Rem Hida, Masaki Hamada, Chie Kamada, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura
Pre Trained Language Model Text to Speech Synthesized Speech Natural Sounding Speech Accent Recognition Polyphone Disambiguation

January 19, 2022

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription
Dabiao Ma, Yitong Zhang, Meng Li, Feng Ye
End to End Synthesized Speech Multi Speaker Spontaneous Speech Multi Speaker Text to Speech Multi Speaker Tt Transcription Error

December 23, 2021

Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios
Qicong Xie, Tao Li, Xinsheng Wang, Zhichao Wang, Lei Xie, Guoqiao Yu, Guanglu Wan
Style Transfer Speech Synthesis Synthesized Speech Expressive Speech Single Speaker

November 15, 2021

November 7, 2021

Emotional Prosody Control for Speech Generation
Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi
Synthesized Speech Prosodic Feature Speech Generation Emotion Space Emotion Shift

Synthesized Speech

Papers

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

On incorporating social speaker characteristics in synthetic speech

Text-To-Speech Data Augmentation for Low Resource Speech Recognition

Does human speech follow Benford's Law?

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

The VoiceMOS Challenge 2022

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

Improving Cross-lingual Speech Synthesis with Triplet Training Scheme

Distribution augmentation for low-resource expressive text-to-speech

Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric Speech Recognition

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

Multi-speaker Multi-style Text-to-speech Synthesis With Single-speaker Single-style Training Data Scenarios

Analysis of Data Augmentation Methods for Low-Resource Maltese ASR

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Emotional Prosody Control for Speech Generation