Audio Visual Separation

Audio-visual separation aims to isolate individual sound sources from a mixture of audio and video data, improving upon traditional audio-only separation methods. Current research focuses on enhancing separation accuracy by incorporating spatial information, handling invisible sounds, and improving the quality and generalization capabilities of models, employing techniques like generative diffusion models and transformer-based architectures with attention mechanisms. These advancements are significant for applications such as virtual and augmented reality, improving the realism and clarity of audio experiences, and also for creating more robust and adaptable sound separation systems in diverse environments.

Papers