Non Streaming

Non-streaming automatic speech recognition (ASR) models process the entire audio input before generating transcriptions, offering superior accuracy compared to their streaming counterparts which process audio in real-time. Current research focuses on bridging the performance gap between streaming and non-streaming ASR, employing techniques like knowledge distillation to transfer knowledge from non-streaming models to streaming ones, and using contextual biasing and contrastive learning to improve accuracy. These advancements aim to improve the accuracy of real-time speech recognition systems while maintaining low latency, impacting applications such as voice search, virtual assistants, and on-device speech processing.

Papers

July 3, 2024

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition
Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei
Automatic Speech Recognition Speech Recognition End to End Accented Speech Non Streaming

April 15, 2024

Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR
Zelin Wu, Gan Song, Christopher Li, Pat Rondon, Zhong Meng, Xavier Velez, Weiran Wang, Diamantino Caseiro, Golan Pundak, Tsendsuren Munkhdalai, Angad Chandorkar, Rohit Prabhavalkar
Contextual Biasing Context Encoder Attention Bias Context Encoding Speech Recognizers Context Injection Non Streaming

November 15, 2023

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Jin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
Automatic Speech Recognition Text Data Rare Word Non Streaming Error Rate

August 31, 2023

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer
Kyuhong Shim, Jinkyu Lee, Simyung Chang, Kyuwoong Hwang
Knowledge Distillation Automatic Speech Recognition Non Streaming

June 27, 2023

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation
Haitao Tang, Yu Fu, Lei Sun, Jiabin Xue, Dan Liu, Yongchao Li, Zhiqiang Ma, Minghui Wu, Jia Pan, Genshun Wan, Ming'en Zhao
Knowledge Distillation Speech Recognition Librispeech Speech Recognition Online Streaming Two Stage Knowledge Distillation Non Streaming

June 1, 2023

Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning
Yuting Yang, Yuke Li, Binbin Du
Contrastive Learning Representation Gap Non Streaming Automatic Speech Recognition Non Streaming

January 17, 2023

Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-streaming Transducer
Zhanheng Yang, Sining Sun, Xiong Wang, Yike Zhang, Long Ma, Lei Xie
Contextual Information Light Stage End 2 End ASR Non Streaming Context Bias Contextual Asr

January 11, 2023

Dual Learning for Large Vocabulary On-Device ASR
Cal Peyser, Ronny Huang, Tara Sainath, Rohit Prabhavalkar, Michael Picheny, Kyunghyun Cho
Semi Supervised Speech Processing Small Training Dual Learning Device Automatic Speech Recognition Non Streaming

June 26, 2022

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode
Raviraj Joshi, Subodh Kumar
Automatic Speech Recognition Human Attention Transformer Architecture Encoder Side Language Model Rescoring Non Streaming

March 29, 2022

Streaming parallel transducer beam search with fast-slow cascaded encoders
Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L Seltzer
Beam Search Encoder Side RNN Transducer Non Streaming Automatic Speech Recognition Non Streaming

Non Streaming

Papers

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition

Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning

Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-streaming Transducer

Dual Learning for Large Vocabulary On-Device ASR

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

Streaming parallel transducer beam search with fast-slow cascaded encoders