Speech Text

Speech-text research focuses on developing models that effectively bridge the gap between spoken and written language, aiming for improved understanding and generation of both modalities. Current efforts concentrate on joint pre-training of speech and text using encoder-decoder architectures and multi-task learning, often incorporating self-supervised tasks to leverage unlabeled data and improve cross-modal alignment. These advancements are significantly impacting automatic speech recognition, speech translation, and text-to-speech synthesis, leading to more accurate and natural-sounding systems with improved performance even in low-resource scenarios.

Papers