Zero Shot Speech

Zero-shot speech translation aims to translate speech from one language to text in another without using any paired speech-text training data for that specific language pair. Current research focuses on bridging the "modality gap" between speech and text using techniques like multilingual training, shared embedding spaces (often fixed-size representations), and discrete cross-modal alignment to map speech and text into a common semantic space. These advancements leverage existing large language models and automatic speech recognition data to achieve surprisingly strong performance, even rivaling supervised methods in some cases, opening up possibilities for more efficient and broadly applicable speech translation systems.

Papers