Audio Visual Target
Audio-visual target speaker extraction (AV-TSE) focuses on isolating a specific person's speech from a noisy audio mixture using accompanying video of their lip movements. Current research emphasizes overcoming the challenges of modality imbalance (audio often dominating) and improving the accuracy of speech extraction in real-world, noisy environments through novel architectures like SepFormer and attention mechanisms that effectively fuse audio and visual information. These advancements are crucial for improving automatic speech recognition (ASR) systems and have significant implications for applications in robotics, human-computer interaction, and other audio-visual technologies.
Papers
April 29, 2024
April 19, 2024
March 24, 2024
January 8, 2024
September 15, 2023
June 25, 2023