End Speech to Text Translation

End-to-end speech-to-text translation aims to directly convert spoken language in one language to written text in another, bypassing the intermediate steps of separate speech recognition and machine translation. Current research focuses on improving model architectures, such as transformer-based networks and connectionist temporal classification models, often employing multi-tasking, consistency regularization, and data augmentation techniques to bridge the modality gap between speech and text and address data scarcity. These advancements hold significant promise for enhancing cross-lingual communication in various applications, including real-time interpretation, automated subtitling, and accessibility tools.

Papers