Audio Visual Generalized Zero Shot

Audio-visual generalized zero-shot learning (GZSL) aims to classify videos into unseen categories using only information from seen categories, leveraging both audio and visual cues. Current research focuses on improving the alignment of audio-visual features with textual class descriptions, often employing contrastive learning and cross-modal attention mechanisms within various architectures, including those based on pre-trained large multi-modal models and generative adversarial networks. This field is significant because it advances the understanding of multi-modal learning and enables more robust and efficient video classification systems, potentially impacting applications such as video indexing, content retrieval, and assistive technologies.

Papers