High Quality Speech
High-quality speech synthesis research aims to generate natural-sounding, human-like speech from various inputs, including text, articulatory data, and even silent videos. Current efforts focus on improving model efficiency and robustness using architectures like neural codec language models, diffusion models, and transformer-based networks, often incorporating techniques like parameter-efficient fine-tuning and multi-task learning to enhance both speed and quality. These advancements have significant implications for applications such as audiobook production, virtual assistants, accessibility tools for the visually impaired, and improving speech processing in noisy or challenging environments. The field is also actively addressing issues like emotional expression, speaker personalization, and multilingual capabilities.
Papers
A methodological framework and exemplar protocol for the collection and analysis of repeated speech samples
Nicholas Cummins, Lauren L. White, Zahia Rahman, Catriona Lucas, Tian Pan, Ewan Carr, Faith Matcham, Johnny Downs, Richard J. Dobson, Thomas F. Quatieri, Judith Dineley
SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark
Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari