Audio Visual Source Localization
Audio-visual source localization (AVSL) aims to pinpoint the location of sound sources within a video by integrating audio and visual information. Current research heavily focuses on improving the accuracy and robustness of AVSL models, particularly addressing challenges like visual biases in datasets and the limitations of existing methods in handling multiple sound sources or noisy environments. This involves developing novel semi-supervised and self-supervised learning techniques, often employing contrastive learning or teacher-student architectures, to leverage both labeled and unlabeled data effectively. Advances in AVSL have significant implications for applications such as augmented reality, assistive technologies, and enhancing human-computer interaction.
Papers
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Yuxin Guo, Shijie Ma, Hu Su, Zhiqing Wang, Yuhao Zhao, Wei Zou, Siyang Sun, Yun Zheng
Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization
Yuxin Guo, Shijie Ma, Yuhao Zhao, Hu Su, Wei Zou