Text to Speech
Text-to-speech (TTS) research aims to synthesize natural-sounding human speech from textual input, focusing on improving speech quality, speaker similarity, and efficiency. Current efforts concentrate on developing advanced architectures like diffusion models and transformers, often incorporating techniques such as flow matching and semantic communication to enhance both the naturalness and expressiveness of generated speech. This field is crucial for applications ranging from assistive technologies and accessibility tools to combating deepfakes and creating more realistic synthetic datasets for training other AI models.
Papers
LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion
Chunfeng Wang, Peisong Huang, Yuxiang Zou, Haoyu Zhang, Shichao Liu, Xiang Yin, Zejun Ma
Leveraging Large Text Corpora for End-to-End Speech Summarization
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura