Paper ID: 2309.17267

Wiki-En-ASR-Adapt: Large-scale synthetic dataset for English ASR Customization

Alexandra Antonova

We present a first large-scale public synthetic dataset for contextual spellchecking customization of automatic speech recognition (ASR) with focus on diverse rare and out-of-vocabulary (OOV) phrases, such as proper names or terms. The proposed approach allows creating millions of realistic examples of corrupted ASR hypotheses and simulate non-trivial biasing lists for the customization task. Furthermore, we propose injecting two types of ``hard negatives" to the simulated biasing lists in training examples and describe our procedures to automatically mine them. We report experiments with training an open-source customization model on the proposed dataset and show that the injection of hard negative biasing phrases decreases WER and the number of false alarms.

Submitted: Sep 29, 2023

Topics

Automatic Speech Recognition
Speech Recognition
Spelling Correction
Automatic Speech Recognition Hypothesis
Large Scale Synthetic Dataset

Links

arXiv PDF