Open Whisper Style Speech Model
Open Whisper-style Speech Models (OWSMs) aim to create open-source, high-performance speech-to-text systems replicating the capabilities of closed models like OpenAI's Whisper. Current research focuses on improving accuracy and efficiency through techniques like refined tokenization, data filtering, integration of visual cues (e.g., lip reading), and the use of encoder-only architectures such as Connectionist Temporal Classification (CTC) networks. These advancements enhance transcription accuracy, robustness to noise and multiple speakers, and enable new applications like audio-visual speech recognition and keyword-guided transcription, impacting fields ranging from accessibility technologies to human-robot interaction.
Papers
Connecting Speech Encoder and Large Language Model for ASR
Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe
Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person
Lucas Rafael Stefanel Gris, Ricardo Marcacini, Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Sandra Maria Aluísio
On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications
Vamsikrishna Chemudupati, Marzieh Tahaei, Heitor Guimaraes, Arthur Pimentel, Anderson Avila, Mehdi Rezagholizadeh, Boxing Chen, Tiago Falk