Audio Visual Segmentation
Audio-visual segmentation (AVS) aims to identify and delineate the visual sources of sounds within video frames, generating pixel-level masks that correspond to audible objects. Current research heavily utilizes transformer-based architectures, focusing on improving efficiency for real-time applications, mitigating biases stemming from inherent data distributions, and enhancing the integration of audio and visual cues through techniques like adaptive query generation and multi-modal attention mechanisms. These advancements are significant for applications in areas such as video editing, augmented reality, and robotics, where accurate understanding of audio-visual relationships is crucial. Furthermore, research is exploring weakly-supervised and even unsupervised approaches to reduce reliance on expensive pixel-level annotations.
Papers
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao
Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation
Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu