Speech Synthesis
Speech synthesis aims to generate human-like speech from text or other inputs, focusing on improving naturalness, expressiveness, and efficiency. Current research emphasizes advancements in model architectures like diffusion models, generative adversarial networks (GANs), and large language models (LLMs), often incorporating techniques such as low-rank adaptation (LoRA) for parameter efficiency and improved control over aspects like emotion and prosody. These improvements have significant implications for applications ranging from assistive technologies for the visually impaired to creating realistic virtual avatars and enhancing accessibility for under-resourced languages.
Papers
VoicePrivacy 2022 System Description: Speaker Anonymization with Feature-matched F0 Trajectories
Ünal Ege Gaznepoglu, Anna Leschanowsky, Nils Peters
Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation
Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Georgia Maniati, Panos Kakoulidis, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis
Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments
Anusha Prakash, S Umesh, Hema A Murthy
NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit
Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda
Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis
Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana