Audio Visual Speech Enhancement

Audio-visual speech enhancement (AVSE) aims to improve the clarity of speech recordings by combining audio and visual (typically lip movement) information, overcoming limitations of audio-only enhancement in noisy environments. Current research emphasizes developing robust and efficient models, often employing deep learning architectures like transformers, convolutional neural networks (including U-Nets and complex U-Nets), recurrent neural networks (LSTMs), and diffusion models, with a focus on real-time processing and handling diverse noise conditions. This field is significant for improving speech recognition accuracy, assistive listening devices (like hearing aids), and human-computer interaction, particularly in challenging acoustic scenarios.

Papers