Audio Visual Speech Representation

Audio-visual speech representation focuses on creating computational models that effectively integrate audio and visual information from speech, aiming to improve tasks like speech recognition and lip reading. Current research emphasizes self-supervised learning methods, often employing transformer-based architectures like HuBERT variants, to learn robust representations from large, unlabeled datasets, and incorporating techniques like viseme analysis and contextual modeling via LLMs to enhance accuracy. These advancements hold significant promise for improving human-computer interaction, accessibility technologies for the hearing impaired, and robust speech processing in noisy environments.

Papers