Synthesized Speech

Synthesized speech research focuses on creating realistic and natural-sounding artificial speech, primarily for applications like voice assistants, audiobooks, and accessibility tools. Current efforts concentrate on improving the naturalness and expressiveness of synthesized speech, often using deep learning models like GANs, diffusion models, and transformers, and addressing challenges such as detecting synthetic speech (deepfakes) and mitigating biases in these detection systems. This field is crucial for advancing human-computer interaction, improving accessibility technologies, and combating the malicious use of synthetic audio in fraud and disinformation.

Papers

March 31, 2024

Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation
Rohan Chaudhury, Mihir Godbole, Aakash Garg, Jinsil Hwaryoung Seo
Speech Synthesis Synthesized Speech Conversational System Human Communication Speech Pattern Zero Shot Emotion Disfluency Generation

March 9, 2024

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen
Full Model Text to Speech Synthesized Speech Text to Speech Model Zero Shot Text to Speech Zero Shot Voice Conversion Data Scaling

March 5, 2024

AttentionStitch: How Attention Solves the Speech Editing Problem
Antonios Alexos, Pierre Baldi
Human Attention Synthesized Speech Speech Generation High Quality Speech Pay Attention User Utterance Speech Editing

February 22, 2024

Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer
Amit Kumar Singh Yadav, Ziyue Xiang, Kratika Bhagtani, Paolo Bestagini, Stefano Tubaro, Edward J. Delp
Synthesized Speech Audio Spectrogram Transformer Synthetic Speech Detection Synthetic Speech Detector

February 19, 2024

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models
Miri Varshavsky-Hassid, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin
Latent Space Text to Speech Synthesized Speech Denoising Diffusion Model Diffusion Based Text

February 8, 2024

Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content
Davide Salvi, Temesgen Semu Balcha, Paolo Bestagini, Stefano Tubaro
Synthesized Speech Best Fit Line Synthetic Speech Detection Verbal Communication Synthetic Speech Detector Audio Forensics

January 22, 2024

Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis
Vinotha R, Hepsiba D, L. D. Vijay Anand, Deepak John Reji
Text to Speech Speech Synthesis Timely Communication Synthesized Speech Speech Technology Voice Cloning

January 14, 2024

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen
Zero Shot Synthesized Speech Zero Shot Text to Speech Codec Language Model Audio Token

January 8, 2024

Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss
Yusheng Tian, Jingyu Li, Tan Lee
Synthesized Speech Generated Text Synthetic Voice Reconstruction Loss Personalized Speech Speech Sound Disorder Dysarthric Speech Reconstruction

January 5, 2024

Pheme: Efficient and Conversational Speech Generation
Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulić
High Efficiency Text to Speech Synthesized Speech Speech Generation Non Autoregressive Text to Speech Conversation Generation

December 21, 2023

Style Modeling for Multi-Speaker Articulation-to-Speech
Miseul Kim, Zhenyu Piao, Jihyun Lee, Hong-Goo Kang
Synthesized Speech High Quality Speech Articulatory Signal Neural 3D Articulation

December 19, 2023

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis
Xueyuan Chen, Xi Wang, Shaofei Zhang, Lei He, Zhiyong Wu, Xixin Wu, Helen Meng
Synthesized Speech Self Supervised Task Style Encoder Audiobook Speech Synthesis

December 16, 2023

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis
Yayue Deng, Jinlong Xue, Yukang Jia, Qifei Li, Yichen Han, Fengping Wang, Yingming Gao, Dengfeng Ke, Ya Li
Contrastive Learning Synthesized Speech Prosodic Feature Contextual Understanding Conversational Speech Synthesis

November 13, 2023

Parrot-Trained Adversarial Examples: Pushing the Practicality of Black-Box Audio Attacks against Speaker Recognition Models
Rui Duan, Zhe Qu, Leah Ding, Yao Liu, Zhuo Lu
Adversarial Example Synthesized Speech Practical Application Speaker Recognition Model

October 10, 2023

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion
Haeyun Choi, Jio Gim, Yuho Lee, Youngin Kim, Young-Joo Suh
Synthesized Speech Speech Encoder Zero Shot Voice Conversion Speaker Independent Lingual Voice Conversion

October 6, 2023

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie
Zero Shot U Net Synthesized Speech Voice Cloning Shot Voice Cloning

October 1, 2023

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech
Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha Raj, Rita Singh
Speech Synthesis Synthesized Speech Speech Intelligibility

September 29, 2023

September 19, 2023

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition
Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang Zhang, Xie Chen
Speech Synthesis Speech Emotion Recognition Synthesized Speech Pre Trained Speech Model Emotional Text to Speech Synthetic Emotional Speech

Synthesized Speech

Papers

Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

AttentionStitch: How Attention Solves the Speech Editing Problem

Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content

Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss

Pheme: Efficient and Conversational Speech Generation

Style Modeling for Multi-Speaker Articulation-to-Speech

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Parrot-Trained Adversarial Examples: Pushing the Practicality of Black-Box Audio Attacks against Speaker Recognition Models

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition