Paper ID: 2410.18850

We Augmented Whisper With kNN and You Won't Believe What Came Next

Maya K. Nachesa, Vlad Niculae

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

Submitted: Oct 24, 2024

Topics

Machine Translation
End to End
Speaker Adaptation
Neural Decoder

Links

arXiv PDF