Audio Visual Source Localization

Audio-visual source localization (AVSL) aims to pinpoint the location of sound sources within a video by integrating audio and visual information. Current research heavily focuses on improving the accuracy and robustness of AVSL models, particularly addressing challenges like visual biases in datasets and the limitations of existing methods in handling multiple sound sources or noisy environments. This involves developing novel semi-supervised and self-supervised learning techniques, often employing contrastive learning or teacher-student architectures, to leverage both labeled and unlabeled data effectively. Advances in AVSL have significant implications for applications such as augmented reality, assistive technologies, and enhancing human-computer interaction.

Papers