Paper ID: 2412.01488 • Published Dec 2, 2024
TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization
Hugo Malard, Michel Olvera, Stephane Lathuiliere, Slim Essid
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Large-scale pre-trained audio and image models demonstrate an unprecedented
degree of generalization, making them suitable for a wide range of
applications. Here, we tackle the specific task of sound-prompted segmentation,
aiming to segment image regions corresponding to objects heard in an audio
signal. Most existing approaches tackle this problem by fine-tuning pre-trained
models or by training additional modules specifically for the task. We adopt a
different strategy: we introduce a training-free approach that leverages
Non-negative Matrix Factorization (NMF) to co-factorize audio and visual
features from pre-trained models so as to reveal shared interpretable concepts.
These concepts are passed on to an open-vocabulary segmentation model for
precise segmentation maps. By using frozen pre-trained models, our method
achieves high generalization and establishes state-of-the-art performance in
unsupervised sound-prompted segmentation, significantly surpassing previous
unsupervised methods.