Open Vocabulary Action Recognition

Open-vocabulary action recognition (OVAR) aims to enable computers to recognize actions from video, even those not seen during training, by leveraging the power of vision-language models like CLIP. Current research focuses on improving the robustness of these models to noisy or ambiguous action descriptions, addressing challenges in cross-domain generalization, and exploring methods like residual feature distillation and multi-modal prompting to enhance performance. These advancements are significant because they pave the way for more versatile and adaptable video understanding systems with applications in areas such as video retrieval, automated surveillance, and human-computer interaction.

Papers