Frame Level Pseudo Label
Frame-level pseudo-labeling is a technique used in weakly supervised and unsupervised learning to address the scarcity of fully annotated data in various computer vision and audio processing tasks. Researchers are actively exploring methods to generate reliable frame-level pseudo-labels from various sources, including prototypical distributions, self-supervised models (like DINO and CLIP), and optical flow, often incorporating these labels into transformer-based architectures or self-training frameworks. This approach significantly reduces the need for expensive manual annotation, enabling the development of high-performing models for tasks such as video instance segmentation, sound event detection, and action recognition, ultimately advancing the capabilities of these fields.