Audio Visual Video Parsing
Audio-visual video parsing (AVVP) aims to automatically identify and temporally locate events within videos using both audio and visual information, a challenging task due to overlapping events and often weakly-supervised data (only video-level labels are available). Current research focuses on improving the decoding phase of models, employing techniques like label-semantic projection and pseudo-labeling to enhance event classification and localization accuracy, often within transformer-based architectures. These advancements are significant for improving video understanding and have applications in areas such as video indexing, content summarization, and assistive technologies for the visually or hearing impaired.
Papers
July 11, 2024
June 3, 2024
May 17, 2024
November 14, 2023
October 11, 2023
July 5, 2023
June 1, 2023
May 27, 2023
March 4, 2023
October 16, 2022
April 25, 2022
March 31, 2022
November 29, 2021