Paper ID: 2410.03719

FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency

Rui Liu, Jiatian Xi, Ziyue Jiang, Haizhou Li

Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting to generate updated audio without altering the original recording directly. Text-based speech editing (TSE) allows users to modify speech by editing the corresponding text and performing operations such as cutting, copying, and pasting to generate updated audio without altering the original recording directly. While current TSE techniques focus on minimizing discrepancies between generated speech and reference targets within edited segments, they often neglect the importance of maintaining both local and global fluency in the context of the original discourse. Additionally, seamlessly integrating edited segments with unaltered portions of the audio remains challenging, typically requiring support from text-to-speech (TTS) systems. This paper introduces a novel approach, FluentEditor$\tiny +$, designed to overcome these limitations. FluentEditor$\tiny +$ employs advanced feature extraction techniques to capture both acoustic and prosodic characteristics, ensuring fluent transitions between edited and unedited regions. The model ensures segmental acoustic smoothness and global prosody consistency, allowing seamless splicing of speech while preserving the coherence and naturalness of the output. Extensive experiments on the VCTK and LibriTTS datasets show that FluentEditor$\tiny +$ surpasses existing TTS-based methods, including Editspeech, Campnet, $A^3T$ FluentSpeech, and Fluenteditor, in both fluency and prosody. Ablation studies further highlight the contributions of each module to the overall effectiveness of the system.

Submitted: Sep 28, 2024