Audio Visual Target

Audio-visual target speaker extraction (AV-TSE) focuses on isolating a specific person's speech from a noisy audio mixture using accompanying video of their lip movements. Current research emphasizes overcoming the challenges of modality imbalance (audio often dominating) and improving the accuracy of speech extraction in real-world, noisy environments through novel architectures like SepFormer and attention mechanisms that effectively fuse audio and visual information. These advancements are crucial for improving automatic speech recognition (ASR) systems and have significant implications for applications in robotics, human-computer interaction, and other audio-visual technologies.

Papers