Masked Prediction

Masked prediction is a self-supervised learning technique where parts of an input (e.g., words in text, pixels in images, or segments in audio) are masked, and a model is trained to predict the missing information based on the remaining context. Current research focuses on applying this technique across diverse modalities, including text, images, video, audio, and even 3D point clouds, often employing transformer-based architectures and exploring variations like multi-resolution processing and incorporating additional information such as character attributes or geometric shapes. This approach has proven effective in pre-training large models for various downstream tasks, improving efficiency and performance, particularly in scenarios with limited labeled data, and offering insights into model explainability and robustness.

Papers