Speech Synthesis
Speech synthesis aims to generate human-like speech from text or other inputs, focusing on improving naturalness, expressiveness, and efficiency. Current research emphasizes advancements in model architectures like diffusion models, generative adversarial networks (GANs), and large language models (LLMs), often incorporating techniques such as low-rank adaptation (LoRA) for parameter efficiency and improved control over aspects like emotion and prosody. These improvements have significant implications for applications ranging from assistive technologies for the visually impaired to creating realistic virtual avatars and enhancing accessibility for under-resourced languages.
Papers
FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
Hanzhao Li, Yuke Li, Xinsheng Wang, Jingbin Hu, Qicong Xie, Shan Yang, Lei Xie
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang
DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
Weidong Chen, Shan Yang, Guangzhi Li, Xixin Wu