Video Representation Learning with Joint-Embedding Predictive Architectures [2412.10925]