Open Vocabulary Video Instance Segmentation

Open-vocabulary video instance segmentation (OV-VIS) aims to automatically detect, segment, and track objects in videos, even those unseen during model training. Current research focuses on adapting vision-language models like CLIP, and developing novel architectures such as transformers, to handle the challenges of open-vocabulary classification and temporal consistency in video data. These advancements improve the generalization ability of video instance segmentation models, paving the way for more robust and versatile applications in areas like autonomous driving and video understanding. The creation of large-scale benchmark datasets is also crucial for driving progress in this rapidly evolving field.

Papers