Visual Grounding

Visual grounding is the task of connecting natural language descriptions to corresponding regions within an image or 3D scene. Current research focuses on improving the accuracy and efficiency of visual grounding models, often employing transformer-based architectures and leveraging large multimodal language models (MLLMs) for enhanced feature fusion and reasoning capabilities. This field is crucial for advancing embodied AI, enabling robots and other agents to understand and interact with the world through natural language, and has significant implications for applications such as robotic manipulation, visual question answering, and medical image analysis.

Papers

May 18, 2023

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement
Davide Rigoni, Luca Parolari, Luciano Serafini, Alessandro Sperduti, Lamberto Ballan
Visual Grounding Language Grounding Sentence Image Pair Semantic Knowledge Emphasized Report Refinement Region Phrase

May 15, 2023

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
Linhui Xiao, Xiaoshan Yang, Fang Peng, Ming Yan, Yaowei Wang, Changsheng Xu
Pseudo Label Single CLIP Visual Grounding Adaptive Curriculum Language Label

April 20, 2023

Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining
Qin Chao, Eunsoo Kim, Boyang Li
Self Supervised Visual Grounding Self Supervised Pretraining Self Supervised Training Effective Representation

April 12, 2023

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
Zhenxiang Lin, Xidong Peng, Peishan Cong, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma
Natural Language 3D Object Visual Grounding 3D Model

March 29, 2023

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance
Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
Multi View Visual Grounding GPT Neo Best View Prototype Selection 3D Modality

March 23, 2023

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding
Ziyang Lu, Yunqiang Pei, Guoqing Wang, Yang Yang, Zheng Wang, Heng Tao Shen
Human Robot Interaction Visual Grounding

March 21, 2023

Joint Visual Grounding and Tracking with Natural Language Specification
Li Zhou, Zikun Zhou, Kaige Mao, Zhenyu He
Web Tracking Visual Grounding Natural Language Specification

March 13, 2023

Parallel Vertex Diffusion for Unified Visual Grounding
Zesen Cheng, Kehan Li, Peng Jin, Xiangyang Ji, Li Yuan, Chang Liu, Jie Chen
Ground Truth Visual Grounding Goal Vertex Autoregressive Graph

March 7, 2023

Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions
Bill Noble, Nikolai Ilinykh
Visual Grounding Perceptual Concept Exemplar Guided

February 24, 2023

A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter
Kechun Xu, Shuqi Zhao, Zhongxiang Zhou, Zizhang Li, Huaijin Pi, Yue Wang, Rong Xiong
Visual Grounding Object Centric Representation Dense Clutter Joint Modeling Vision Language Action Grasp Distribution Target Driven Grasping Language Guided

February 22, 2023

Focusing On Targets For Improving Weakly Supervised Visual Grounding
Viet-Quoc Pham, Nao Mishima
Vision Language Visual Grounding Semantic Representation Multiple Target Spatial Semantic Collaborative Cropping

January 22, 2023

Champion Solution for the WSDM2023 Toloka VQA Challenge
Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
Visual Question Answering Visual Grounding Cross Modal Localization

January 20, 2023

Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences
Xudong Hong, Asad Sayeed, Khushboo Mehra, Vera Demberg, Bernt Schiele
Visual Grounding Visual Prompt Story Generation Image Sequence Visual Story Generation Image Narrative Generation

December 19, 2022

Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan
Vision Language Cross Modal Visual Grounding

December 1, 2022

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang
Visual Grounding 3D Vision Language 3D Dense Captioning Dense Caption Unified Transformer

November 28, 2022

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
Shilong Liu, Yaoyuan Liang, Feng Li, Shijia Huang, Hao Zhang, Hang Su, Jun Zhu, Lei Zhang
Visual Grounding Grounding Network Attention Masking Phrase Grounding Phrase Mining

November 25, 2022

Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding
Eslam Mohamed Bakr, Yasmeen Alsaedy, Mohamed Elhoseiny
Point Cloud Knowledge Distillation Visual Grounding Geometric Cue

November 15, 2022

October 23, 2022

RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data
Yang Zhan, Zhitong Xiong, Yuan Yuan
Full Model Raw Data Remote Sensing MAESTRO Dataset Visual Grounding