3D Understanding

3D understanding focuses on enabling computers to perceive and interpret three-dimensional scenes and objects, mirroring human spatial reasoning. Current research emphasizes developing robust models that integrate multiple data modalities (point clouds, images, text, even audio) using techniques like multi-modal mixing, contrastive learning, and large language models (LLMs) to improve accuracy and efficiency. This field is crucial for advancements in robotics, autonomous driving, augmented reality, and other applications requiring sophisticated scene understanding, with recent work highlighting the importance of data efficiency and explainability in model development.

Papers

September 1, 2023

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
Large Language Model Language Model Point Cloud Multi Modality Multi Modal Model 3D Understanding Multi Modal Instruction

July 28, 2023

Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding
Jiachen Kang, Wenjing Jia, Xiangjian He, Kin Man Lam
Point Cloud Pre Trained Knowledge Transfer Self Supervised Representation Learning Robust Representation Data Scarcity High Quality Image Point Cloud Understanding 3D Understanding

May 14, 2023

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese
Multi Modal Large Multimodal Model Multimodal Pre 3D Understanding Multi Modal Pre Training Multi Modal 3D

April 12, 2023

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes
Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anagnostidis, Gregor Bachmann, Thomas Hofmann
Question Answering Vision Language 3D Scene Yes No Question 3D Understanding 3D Reasoning Scene Encoder

December 27, 2022

MVTN: Learning Multi-View Transformations for 3D Understanding
Abdullah Hamdi, Faisal AlZahrani, Silvio Giancola, Bernard Ghanem
Multi View Left Corner Transformation 3D Understanding View Projection Multi View Network

December 22, 2022

Monocular 3D Object Detection using Multi-Stage Approaches with Attention and Slicing aided hyper inference
Abonia Sojasingarayar, Ashish Patel
3D Object Detection Human Attention 3D Detection Monocular 3D Object Detection 2 Dimensional Object Detection 3D Understanding Slice by Slice Stage Approach

December 10, 2022

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese
Point Cloud Zero Shot Human Language Unified Representation 3D Object Classification 3D Understanding 3D Modality

September 24, 2022

Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline
Lichen Zhao, Daigang Cai, Jing Zhang, Lu Sheng, Dong Xu, Rui Zheng, Yinjie Zhao, Lipeng Wang, Xibo Fan
New Benchmark Visual Question Answering Strong Baseline 3D Understanding

November 30, 2021

Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding
Abdullah Hamdi, Silvio Giancola, Bernard Ghanem
Point Cloud Representation 3D Shape Representation 3D Understanding View Projection

3D Understanding

Papers

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

MVTN: Learning Multi-View Transformations for 3D Understanding

Monocular 3D Object Detection using Multi-Stage Approaches with Attention and Slicing aided hyper inference

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline

Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding