Unsupervised Automatic Speech Recognition

Unsupervised automatic speech recognition (ASR) aims to build speech recognition systems without relying on paired speech and text data, a crucial step towards enabling ASR for low-resource languages. Current research focuses on developing novel model architectures, often employing self-supervised learning, reinforcement learning, and adversarial training techniques, to learn the mapping between speech and text from unpaired corpora. These advancements leverage techniques like masked token infilling, boundary segmentation, and cross-lingual pseudo-labeling to improve accuracy and robustness, leading to significant progress in unsupervised speech-to-text and even speech-to-speech tasks. The ultimate goal is to make ASR technology more widely accessible and applicable across diverse languages and domains.

Papers