Target Speech Extraction
Target speech extraction (TSE) aims to isolate a specific speaker's voice from a noisy audio mixture, mimicking the human "cocktail party effect." Current research heavily utilizes deep learning, employing architectures like transformers, diffusion models, and neural beamformers, often incorporating visual cues (e.g., lip movements) or pre-trained self-supervised models to improve accuracy and robustness. This field is significant for advancing human-computer interaction, particularly in robotics and assistive technologies, as well as for improving speech recognition in challenging acoustic environments. Furthermore, research is actively exploring methods to enhance the robustness of TSE systems to variations in speaker characteristics and to minimize false alarms.
Papers
Target Speech Extraction with Pre-trained Self-supervised Learning Models
Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocky
Probing Self-supervised Learning Models with Target Speech Extraction
Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky