Target Sound Extraction
Target sound extraction (TSE) aims to isolate a desired sound from a mixture, using clues like sound class labels, audio queries, timestamps, or even language descriptions. Current research heavily utilizes deep learning models, including transformers, diffusion probabilistic models, and state-space models, often incorporating pre-trained audio foundation models to improve performance and generalization. This field is significant for its potential applications in assistive hearing technologies, audio editing, and enhancing human-computer interaction by enabling more sophisticated and nuanced audio processing capabilities.
Papers
SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model
Carlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Daisuke Niizumi, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki
Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues
Dayun Choi, Jung-Woo Choi