Open Whisper Style Speech Model

Open Whisper-style Speech Models (OWSMs) aim to create open-source, high-performance speech-to-text systems replicating the capabilities of closed models like OpenAI's Whisper. Current research focuses on improving accuracy and efficiency through techniques like refined tokenization, data filtering, integration of visual cues (e.g., lip reading), and the use of encoder-only architectures such as Connectionist Temporal Classification (CTC) networks. These advancements enhance transcription accuracy, robustness to noise and multiple speakers, and enable new applications like audio-visual speech recognition and keyword-guided transcription, impacting fields ranging from accessibility technologies to human-robot interaction.

Papers