Learning text-to-video retrieval from image captioning [2404.17498]