Audio Visual Semantic Segmentation
Audio-visual semantic segmentation (AVSS) aims to identify and classify the sources of sounds within video frames at the pixel level, combining visual and auditory information for improved accuracy. Recent research focuses on extending AVSS to open-vocabulary scenarios, handling partially missing modalities (e.g., limited camera views), and improving training efficiency through techniques like progressive training strategies. These advancements are significant for applications such as augmented reality safety systems and generally improving the understanding of complex audio-visual scenes, pushing the boundaries of multimodal understanding in computer vision and machine learning.
Papers
July 31, 2024
July 16, 2024
December 14, 2023