Zero Shot Temporal Action

Zero-shot temporal action localization (ZS-TAL) aims to identify and locate actions in videos without training on those specific actions, leveraging pre-trained vision-language models (VLMs) to bridge the gap between visual and textual representations. Current research focuses on improving the completeness and accuracy of action proposals, often employing transformer-based architectures and incorporating techniques like contrastive learning and test-time adaptation to enhance generalization to unseen actions. This field is significant because it reduces the reliance on extensive labeled datasets, enabling more efficient and scalable video understanding systems with applications in video retrieval, action recognition, and other areas requiring robust video analysis.

Papers