Distilling Vision-Language Models on Millions of Videos [2401.06129]