Audio Visual Speaker Extraction
Audio-visual speaker extraction aims to isolate a target speaker's voice from a mixture of sounds using both audio and video input, improving upon audio-only methods. Current research focuses on enhancing robustness to factors like speaker pose variations and intermittent visual cues, employing techniques like pose-invariant networks and visual embedding inpainting within time-domain models incorporating visual and contextual cues (e.g., phonetic sequences). These advancements are significant for improving speech recognition in noisy environments and enabling more natural and robust human-computer interaction, particularly in applications like video conferencing and assistive listening devices.
Papers
September 15, 2023
September 13, 2023
June 5, 2023
October 31, 2022
October 11, 2022
October 9, 2022
July 9, 2022
March 31, 2022