Text Based Speech Editing

Text-based speech editing aims to modify audio recordings by manipulating their corresponding text transcripts, offering a more intuitive and efficient alternative to manual waveform manipulation. Current research focuses on improving the naturalness and fluency of edited speech, often employing neural network architectures like transformers and diffusion models, and incorporating techniques such as context-aware prosody correction and semantic enrichment to enhance intelligibility and consistency. This field is significant for its potential to revolutionize audio and video production, enabling faster and more precise editing while also offering applications in accessibility technologies for individuals with speech impediments.

Papers