Audio Visual Speech Separation

Audio-visual speech separation (AVSS) aims to isolate individual voices from a mixture using both audio and visual information, improving upon audio-only methods, particularly in noisy or multi-speaker environments. Current research focuses on developing robust models that handle missing or noisy visual cues, employing techniques like attention mechanisms, diffusion models, and efficient architectures (e.g., transformer-based networks) to achieve accurate and computationally efficient separation. These advancements have significant implications for applications such as speech recognition, meeting transcription, and assistive technologies by enhancing the robustness and accuracy of speech processing in real-world scenarios.

Papers