Visual Grounding

Visual grounding is the task of connecting natural language descriptions to corresponding regions within an image or 3D scene. Current research focuses on improving the accuracy and efficiency of visual grounding models, often employing transformer-based architectures and leveraging large multimodal language models (MLLMs) for enhanced feature fusion and reasoning capabilities. This field is crucial for advancing embodied AI, enabling robots and other agents to understand and interact with the world through natural language, and has significant implications for applications such as robotic manipulation, visual question answering, and medical image analysis.

Papers

October 23, 2022

RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data
Yang Zhan, Zhitong Xiong, Yuan Yuan
Full Model Raw Data Remote Sensing MAESTRO Dataset Visual Grounding

October 22, 2022

Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang
Visual Representation Visual Grounding Linguistic Representation Language Alignment Hierarchical Alignment Learning Based Point Japanese Sentence

October 11, 2022

Like a bilingual baby: The advantage of visually grounding a bilingual language model
Khai-Nguyen Nguyen, Zixin Tang, Ankur Mali, Alex Kelly
Language Model Visual Grounding Neural Language Model Advantage Feedback Bilingual Model Large Scale Naturalistic Stronger Bilingual Learning

October 10, 2022

YFACC: A Yor\`ub\'a speech-image dataset for cross-lingual keyword localisation through visual grounding
Kayode Olaleye, Dan Oneata, Herman Kamper
Low Resource Language Visual Grounding Audio Caption Visual Speech Single Speaker Visually Grounded Speech System Keyword Localisation

October 7, 2022

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction
Maria Attarian, Advaya Gupta, Ziyi Zhou, Wei Yu, Igor Gilitschenski, Animesh Garg
Visual Grounding Pre Trained Transformer Video Prediction Best View High Level Plan Language Planning

October 3, 2022

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach
Georgios Tziafas, Hamidreza Kasaei
Inherent Interpretability Visual Grounding Robot Manipulation Interactive No Code 3D Vision Language Neurosymbolic Approach Neurosymbolic Framework Visual Language Reasoning

October 1, 2022

Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement
Zirui Zhao, Wee Sun Lee, David Hsu
Natural Language Visual Grounding Natural Language Instruction Language Grounding Object Placement Object Placement Task

September 29, 2022

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, Jian Zhang
Visual Grounding Semantic Loss 3D Visual Grounding Dense Alignment EDA Developer

September 28, 2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi, Ruopeng Gao, Weilin Huang, Limin Wang
Visual Grounding Multimodal Transformer Multi Modal Transformer

September 17, 2022

Introspective Learning : A Two-Stage Approach for Inference in Neural Networks
Mohit Prabhushankar, Ghassan AlRegib
Neural Network Scientific Inference Visual Grounding Two Stage Approach

September 8, 2022

Visual Grounding of Inter-lingual Word-Embeddings
Wafaa Mohammed, Hassan Shahmohammadi, Hendrik P. A. Lensch, R. Harald Baayen
Cross Lingual Visual Grounding Language Grounding

August 29, 2022

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor, Guillaume Couairon, Matthieu Cord
Vision Language Visual Grounding Downstream Task Visual Concept Image Level Annotation Visual Entailment Hierarchical Alignment

July 27, 2022

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding
Mengxue Qu, Yu Wu, Wu Liu, Qiqi Gong, Xiaodan Liang, Olga Russakovsky, Yao Zhao, Yunchao Wei
Visual Grounding Language Transformer

July 21, 2022

Grounding Visual Representations with Texts for Domain Generalization
Seonwoo Min, Nokyung Park, Siwon Kim, Seunghyun Park, Jinkyu Kim
Domain Generalization Visual Grounding Text Based Domain Invariant Representation Language Supervision Cross Modal Supervision Domain Generalization Task

July 5, 2022

June 30, 2022

June 21, 2022

Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding
Junwen Pan, Guanlin Chen, Yi Liu, Jiexiang Wang, Cheng Bian, Pengfei Zhu, Zhicheng Zhang
Visual Grounding Evidence Piece Image Decoder Text Grounding Natural Language Answer

June 18, 2022

Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution
Chonghan Chen, Qi Jiang, Chih-Hao Wang, Noel Chen, Haohan Wang, Xiang Li, Bhiksha Raj
Visual Grounding Human Mind Query Information Visual Relation