Supervised Temporal Action Localization
Supervised temporal action localization (TAL) aims to automatically identify and pinpoint the start and end times of actions within untrimmed videos, using labeled training data. Recent research heavily focuses on addressing the challenges posed by weak supervision (e.g., only video-level labels), employing techniques like pseudo-label learning, contrastive learning, and graph neural networks to improve localization accuracy. These advancements leverage both visual and textual information, often incorporating transformer architectures and attention mechanisms to better capture temporal dependencies and contextual information within videos. Improved TAL algorithms have significant implications for video understanding applications, such as video retrieval, summarization, and event analysis.