Lightweight Cross-Modal Representation Learning [2403.04650]