Learning Audio-Video Modalities from Image Captions [2204.00679]