Synthesized Speech

Synthesized speech research focuses on creating realistic and natural-sounding artificial speech, primarily for applications like voice assistants, audiobooks, and accessibility tools. Current efforts concentrate on improving the naturalness and expressiveness of synthesized speech, often using deep learning models like GANs, diffusion models, and transformers, and addressing challenges such as detecting synthetic speech (deepfakes) and mitigating biases in these detection systems. This field is crucial for advancing human-computer interaction, improving accessibility technologies, and combating the malicious use of synthetic audio in fraud and disinformation.

Papers

September 14, 2024

The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari
Transfer Learning Image Classification Synthesized Speech Speech Supervised Learning Model Naturalness Assessment

September 11, 2024

D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack
Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Adversarial Attack Study Feature Synthesized Speech Tiny Refinement Elicit Resilience Deepfake Detector Fake Audio CAPtcha Solver Imperceptible Adversarial

August 27, 2024

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech
Haowei Lou, Helen Paik, Wen Hu, Lina Yao
Synthesized Speech High Quality Speech

August 25, 2024

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
Synthesized Speech Diffusion Transformer

August 23, 2024

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples
Zhenyu Wang, John H. L. Hansen
Adversarial Example Speech Recognition Meta Learning Speaker Verification Synthesized Speech Audio Anti Spoofing Disentanglement Framework

August 13, 2024

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders
Yubing Cao, Yongming Li, Liejun Wang, Yinfeng Yu
Speech Synthesis Generative Adversarial Synthesized Speech Modern Vocoders Multi Layer Neural Network High Fidelity Speech

July 26, 2024

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models
Neil Shah, Shirish Karande, Vineet Gandhi
Ground Truth Speech Synthesis Synthesized Speech Self Supervised Speech Model

July 23, 2024

Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments
Pai Zhu, Dhruuv Agarwal, Jacob W. Bartel, Kurt Partridge, Hyun Jin Park, Quan Wang
Text to Speech Low Resource Synthesized Speech Keyword Spotting Google Speech Command Unlabeled Speech

July 17, 2024

TTSDS -- Text-to-Speech Distribution Score
Christoph Minixhofer, Ondřej Klejch, Peter Bell
Text to Speech Synthesized Speech

July 5, 2024

We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings
Ismail Rasim Ulgen, Carlos Busso, John H. L. Hansen, Berrak Sisman
Speech Synthesis Synthesized Speech Speaker Embeddings Category Wise Variation Personalized Speech Sub Center

June 25, 2024

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection
Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng
Synthesized Speech Multi Head Self Attention Channel Wise Synthetic Speech Detection Synthetic Speech Detector

June 22, 2024

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen
Text Modality Synthesized Speech Zero Shot Text to Speech Codec Language Model Speech Synthesizer

June 13, 2024

June 11, 2024

June 7, 2024

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking
Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen
Text to Speech Synthesized Speech Speech to Text Agnostic Watermarking Imperceptible Watermark

June 5, 2024

Style Mixture of Experts for Expressive Text-To-Speech Synthesis
Ahad Jawaid, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman
Style Transfer Expert Knowledge Synthesized Speech Style Consistency Expressive Speech Synthesis Style Encoder

June 2, 2024

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang
Zero Shot Human Feedback Synthesized Speech Speech Quality Speech Perception Emotional Text to Speech

Synthesized Speech

Papers

The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech

D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

TTSDS -- Text-to-Speech Distribution Score

We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

AudioMarkBench: Benchmarking Robustness of Audio Watermarking

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback