2 Dimensional Vision Language Model
Two-dimensional vision-language models (VLMs) aim to bridge the gap between visual and textual information, enabling computers to understand and generate descriptions of images. Current research focuses on leveraging these 2D VLMs to improve 3D scene understanding, often through techniques like feature distillation and cross-modal self-training, with architectures such as CLIP playing a prominent role. This research is significant because it advances the capabilities of AI systems to interact with the physical world through language, impacting applications in robotics, 3D scene segmentation, and other areas requiring multimodal understanding.
Papers
July 13, 2024
June 26, 2024
May 2, 2024
April 15, 2024
March 21, 2024
December 3, 2023
November 15, 2023
September 1, 2023
August 18, 2023
June 29, 2022