3D Vision Language
3D vision-language research aims to enable large language models (LLMs) to understand and interact with 3D environments using natural language. Current research focuses on developing efficient model architectures, often leveraging 2D vision-language models as a foundation and employing techniques like contrastive learning, prompt tuning, and data augmentation (including synthetic data) to overcome the scarcity of large-scale 3D datasets. This field is significant because it bridges the gap between LLMs and the physical world, paving the way for advancements in robotics, augmented reality, and other applications requiring embodied AI. The development of unified models capable of handling diverse 3D representations and performing a wide range of tasks is a key area of ongoing investigation.