Audio Visual Instance

Audio-visual instance segmentation aims to identify, segment, and track individual sound-producing objects within videos, combining audio and visual information. Current research focuses on efficient model architectures, such as Siamese networks and transformers, to handle the large datasets and computational demands of this multi-modal task, often incorporating contrastive learning and cross-modal fusion techniques. This field is significant because it advances multi-modal understanding, potentially impacting applications like video indexing, content analysis, and assistive technologies for the visually or hearing impaired. The development of new benchmark datasets is also driving progress in this rapidly evolving area.

Papers