Visual Grounding

Visual grounding is the task of connecting natural language descriptions to corresponding regions within an image or 3D scene. Current research focuses on improving the accuracy and efficiency of visual grounding models, often employing transformer-based architectures and leveraging large multimodal language models (MLLMs) for enhanced feature fusion and reasoning capabilities. This field is crucial for advancing embodied AI, enabling robots and other agents to understand and interact with the world through natural language, and has significant implications for applications such as robotic manipulation, visual question answering, and medical image analysis.

Papers

July 3, 2024

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
Weitai Kang, Gaowen Liu, Mubarak Shah, Yan Yan
Segmentation Based Approach Visual Grounding Bounding Box Box Annotation Pixel Level Supervision

July 2, 2024

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA
Hailiang Zhang, Dian Chao, Zhihao Guan, Yang Yang
Computer Vision Solution Path Visual Grounding Related Task Video Question Answering Top Two Answer Multiple Choice VideoQA Target Object

July 1, 2024

June 27, 2024

FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, Dan Roth
Visual Grounding Reasoning Task Visual Question Spatial Reasoning Multimodal Reasoning 3d Vqa Flow Diagram

June 26, 2024

On the Role of Visual Grounding in VQA
Daniel Reich, Tanja Schultz
Integral Role Visual Grounding 3d Vqa Grounded Theory VQA Task

June 24, 2024

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, Saining Xie
Visual Representation Visual Grounding Multimodal LLM Visual Representation Learning

June 18, 2024

AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Guang Dai, Ping Chen, Shijian Lu
Large Vision Language Model Visual Grounding Assembly Task Local Feature Local Context Object Hallucination Global Image Feature

June 14, 2024

Grounding Image Matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, Jérôme Revaud
3D Reconstruction 3D Content Visual Grounding Image Matching Dense Feature Dense Matching

June 13, 2024

June 4, 2024

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu
Vision Language Model Visual Grounding Head Pose Estimation

May 29, 2024

Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding
Junjie Fei, Mahmoud Ahmed, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny
Visual Grounding Multimodal LLM Part Aware 3D Vision Language

May 28, 2024

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan
3D Object Detection Visual Grounding Intent Detection Human Intent Intent Prediction RGB D Image 3DPW Dataset

May 27, 2024

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
Haoyu Zhao, Wenhang Ge, Ying-cong Chen
Visual Grounding Capability Evolution LLM Model

May 24, 2024

Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding
Yuhang Liu, Boyi Sun, Guixu Zheng, Yishuo Wang, Jing Wang, Fei-Yue Wang
Visual Grounding Lidar Sensor

May 16, 2024

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models
Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, Shu-Tao Xia
Adversarial Attack Adversarial Robustness Multimodal Large Language Model Visual Grounding Adversarial DEfense

Visual Grounding

Papers

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

Empathic Grounding: Explorations using Multimodal Interaction and Large Language Models with Conversational Agents

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

On the Role of Visual Grounding in VQA

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Grounding Image Matching in 3D with MASt3R

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

F-LMM: Grounding Frozen Large Multimodal Models

A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models