Audio Visual Video Parsing

Audio-visual video parsing (AVVP) aims to automatically identify and temporally locate events within videos using both audio and visual information, a challenging task due to overlapping events and often weakly-supervised data (only video-level labels are available). Current research focuses on improving the decoding phase of models, employing techniques like label-semantic projection and pseudo-labeling to enhance event classification and localization accuracy, often within transformer-based architectures. These advancements are significant for improving video understanding and have applications in areas such as video indexing, content summarization, and assistive technologies for the visually or hearing impaired.

Papers