3D Visual Grounding

3D visual grounding aims to locate objects in 3D scenes based on natural language descriptions, bridging the gap between language and 3D perception. Current research focuses on improving model accuracy and efficiency through techniques like dual-branch decoding, active retraining with pseudo-labels, and leveraging large language models for query interpretation and data-efficient training. These advancements are crucial for developing robust vision-language systems in robotics and other applications requiring precise object localization within complex 3D environments, particularly in scenarios with limited labeled data. The field is also actively addressing challenges such as handling complex linguistic structures (e.g., determiners) and cross-dataset generalization.

11papers

Papers

March 30, 2025

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
Vision Language Model Contrastive Reasoner 3D Visual Grounding Language Navigation Complex Reasoning Visual Grounding

March 8, 2025

Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning
Multimodal Alignment Study Feature Chain of Thought 3D Content 3D Vision Language 3D Visual Grounding Multimodal Reasoning

January 10, 2025

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
Visual Grounding Fine Grained LangId Magic Spell 3D Visual Grounding Image Feature Multimodal Large Language Model Multi Image

January 2, 2025

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
Visual Grounding 3D Visual Grounding Diverse Datasets Language Grounding

July 19, 2024

PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding
Visual Grounding 3D Visual Grounding Attention Map Positional Encoding Parallel Decoding Visual Language Model

July 3, 2024

ACTRESS: Active Retraining for Semi-supervised Visual Grounding
Cross Pseudo Supervision Supervised Attention Visual Grounding Detection Confidence Actor Loss 3D Visual Grounding

March 25, 2024

Data-Efficient 3D Visual Grounding via Order-Aware Referring
3D Visual Grounding Visual Grounding 3D Datasets Visual Semantic

March 13, 2024

SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention
Cross Modal Representation 3D Visual Grounding Relational Learning Visual Grounding Relation Mapping Cross Modal Alignment

January 17, 2024

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
3D Vision Language Language Grounding 3D Visual Grounding Scene Understanding

October 10, 2023

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
Visual Grounding 3D Visual Grounding 3D Datasets

September 21, 2023

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent
Visual Grounding Large Language Model 3D Visual Grounding Agent Smith 3D Vision Language

September 7, 2023

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners
3D Visual Grounding Natural Language Visual Grounding Complexity Level Diagnostic Dataset Higher Quality Reference

July 18, 2023

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
3D Visual Grounding Semantic Matching Sentence Pair Visual Grounding

May 23, 2023

Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
Visual Grounding 3D Visual Grounding RGB D Image

September 29, 2022

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Visual Grounding 3D Visual Grounding Semantic Loss Dense Alignment EDA Developer

3D Visual Grounding

Papers

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding

ACTRESS: Active Retraining for Semi-supervised Visual Grounding

Data-Efficient 3D Visual Grounding via Order-Aware Referring

SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding