Zero Shot Speech

Zero-shot speech translation aims to translate speech from one language to text in another without using any paired speech-text training data for that specific language pair. Current research focuses on bridging the "modality gap" between speech and text using techniques like multilingual training, shared embedding spaces (often fixed-size representations), and discrete cross-modal alignment to map speech and text into a common semantic space. These advancements leverage existing large language models and automatic speech recognition data to achieve surprisingly strong performance, even rivaling supervised methods in some cases, opening up possibilities for more efficient and broadly applicable speech translation systems.

Papers

February 16, 2024

Pushing the Limits of Zero-shot End-to-End Speech Translation
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà
Continuum Limit Speech Representation Multilingual Machine Translation Speech Encoder End to End Speech Translation Modality Gap Zero Shot Speech

October 5, 2023

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
Paul-Ambroise Duquenne, Holger Schwenk, Benoît Sagot
Encoder Side Multilingual Training Fixed Size Zero Shot Speech Speech to Text Translation

August 22, 2023

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paul-Ambroise Duquenne, Holger Schwenk, Benoît Sagot
Sentence Embeddings Language Agnostic Low Cost Obstacle Avoidance Sonar Zero Shot Speech Language Aware Encoder

June 22, 2023

AudioPaLM: A Large Language Model That Can Speak and Listen
Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank
Large Language Model Language Model Speech to Speech Translation Speech Language Model Zero Shot Speech

March 7, 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
Speech Synthesis Zero Shot Cross Lingual Codec Language Model Zero Shot Speech

October 18, 2022

Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation
Chen Wang, Yuchen Liu, Boxing Chen, Jiajun Zhang, Wei Luo, Zhongqiang Huang, Chengqing Zong
Machine Translation Speech Translation Cross Modal Alignment Source Speech Zero Shot Speech

May 24, 2022

T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation
Paul-Ambroise Duquenne, Hongyu Gong, Benoît Sagot, Holger Schwenk
Speech Translation Dynamic ModulE Zero Shot Text to Speech Zero Shot Translation Modal Translation Zero Shot Speech