Paper ID: 2307.14750

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong Cai

Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter and a CLIP guidance objective to enhance contrastive information learning. Experimental results indicate that our method outperforms SOTA captioning models across various settings including zero-shot, unsupervised, semi-supervised, and cross-domain scenarios. Code is available at: this https URL

Submitted: Jul 27, 2023