Audio Visual Speech Recognition

Audio-visual speech recognition (AVSR) aims to improve the accuracy and robustness of automatic speech recognition by incorporating visual information, such as lip movements, to complement audio signals. Current research emphasizes developing robust models that generalize well across diverse video conditions, often employing techniques like mixture-of-experts, large language models, and efficient architectures such as conformers and transformers, sometimes incorporating self-supervised learning to address data scarcity. These advancements are significant for improving speech recognition in noisy environments and for applications requiring multimodal understanding, such as virtual assistants and accessibility technologies.

Papers