Open Vocabulary Segmentation

Open-vocabulary segmentation aims to segment images into regions corresponding to arbitrary object categories specified by text, without requiring extensive labeled training data for each category. Current research heavily utilizes vision-language models like CLIP, often incorporating adapters or other modifications to improve their spatial reasoning and generalization capabilities, sometimes in conjunction with other foundation models like SAM. This field is significant because it reduces the reliance on large, manually annotated datasets, enabling more efficient and flexible image understanding for applications ranging from remote sensing to autonomous driving. The development of training-free or weakly-supervised methods is a key focus, aiming to further reduce annotation costs and improve scalability.

Papers