Open Vocabulary Object Detection
Open-vocabulary object detection (OVOD) aims to enable computer vision systems to identify objects using textual descriptions, even if those objects weren't seen during training. Current research focuses on improving the accuracy and efficiency of OVOD, often leveraging vision-language models (like CLIP) and transformer-based architectures (like DETR) to bridge the gap between visual and textual representations, and addressing challenges like fine-grained attribute recognition and robustness to distribution shifts. The advancements in OVOD have significant implications for various applications, including autonomous driving, robotics, and remote sensing, by enabling more flexible and adaptable object recognition capabilities.
Papers
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection
Shilin Xu, Xiangtai Li, Size Wu, Wenwei Zhang, Yunhai Tong, Chen Change Loy