Multilingual Audio Visual Speech Recognition

Multilingual audio-visual speech recognition (MAVSR) aims to build robust speech recognition systems that leverage both audio and visual information from multiple languages. Current research focuses on developing models, often employing neural architectures like Conformers and incorporating techniques such as cross-lingual pre-training and hybrid CTC/RNN-T approaches, to improve accuracy and noise robustness across diverse languages. The availability of large-scale multilingual datasets, like MuAViC, is driving progress and enabling the creation of single models capable of handling multiple languages simultaneously, thereby reducing the need for language-specific training. This field holds significant promise for improving accessibility of speech technology and advancing research in areas such as speech-to-text translation and human-computer interaction.

Papers