Text Supervision

Text supervision leverages textual information, such as descriptions or reports, to guide the training of computer vision models, particularly in scenarios with limited or expensive labeled image data. Current research focuses on integrating text supervision into vision-language models (VLMs) like CLIP, employing techniques like prompt learning, knowledge distillation, and contrastive learning to improve model performance on tasks such as image classification, segmentation, and object detection. This approach offers a cost-effective and efficient way to enhance model accuracy and generalization, particularly beneficial in domains like medical imaging and open-vocabulary tasks where labeled data is scarce or expensive to obtain.

Papers