Audio Visual Speaker Extraction

Audio-visual speaker extraction aims to isolate a target speaker's voice from a mixture of sounds using both audio and video input, improving upon audio-only methods. Current research focuses on enhancing robustness to factors like speaker pose variations and intermittent visual cues, employing techniques like pose-invariant networks and visual embedding inpainting within time-domain models incorporating visual and contextual cues (e.g., phonetic sequences). These advancements are significant for improving speech recognition in noisy environments and enabling more natural and robust human-computer interaction, particularly in applications like video conferencing and assistive listening devices.

Papers