2 Dimensional Vision Language Model
Two-dimensional vision-language models (VLMs) aim to bridge the gap between visual and textual information, enabling computers to understand and generate descriptions of images. Current research focuses on leveraging these 2D VLMs to improve 3D scene understanding, often through techniques like feature distillation and cross-modal self-training, with architectures such as CLIP playing a prominent role. This research is significant because it advances the capabilities of AI systems to interact with the physical world through language, impacting applications in robotics, 3D scene segmentation, and other areas requiring multimodal understanding.
Papers
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, Hengshuang Zhao
Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability
Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Lu Cheng, Mengnan Du