3D Visual Grounding

3D visual grounding aims to locate objects in 3D scenes based on natural language descriptions, bridging the gap between language and 3D perception. Current research focuses on improving model accuracy and efficiency through techniques like dual-branch decoding, active retraining with pseudo-labels, and leveraging large language models for query interpretation and data-efficient training. These advancements are crucial for developing robust vision-language systems in robotics and other applications requiring precise object localization within complex 3D environments, particularly in scenarios with limited labeled data. The field is also actively addressing challenges such as handling complex linguistic structures (e.g., determiners) and cross-dataset generalization.

Papers