Paper ID: 2410.14709
A two-stage transliteration approach to improve performance of a multilingual ASR
Rohit Kumar
End-to-end Automatic Speech Recognition (ASR) systems are rapidly claiming to become state-of-art over other modeling methods. Several techniques have been introduced to improve their ability to handle multiple languages. However, due to variation in writing scripts for different languages, while decoding acoustically similar units, they do not always map to an appropriate grapheme in the target language. This restricts the scalability and adaptability of the model while dealing with multiple languages in code-mixing scenarios. This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set obtained by projecting the multilingual grapheme data to the script of a more generic target language. This approach saves the acoustic model from retraining to span over a larger space and can easily be extended to multiple languages. A two-stage transliteration process realizes this approach and proves to minimize speech-class confusion. We performed experiments with an end-to-end multilingual speech recognition system for two Indic Languages, namely Nepali and Telugu. The original grapheme space of these languages is projected to the Devanagari script. We achieved a relative reduction of 20% in the Word Error Rate (WER) and 24% in the Character Error Rate (CER) in the transliterated space, over other language-dependent modeling methods.
Submitted: Oct 9, 2024