Visual Grounding
Visual grounding is the task of connecting natural language descriptions to corresponding regions within an image or 3D scene. Current research focuses on improving the accuracy and efficiency of visual grounding models, often employing transformer-based architectures and leveraging large multimodal language models (MLLMs) for enhanced feature fusion and reasoning capabilities. This field is crucial for advancing embodied AI, enabling robots and other agents to understand and interact with the world through natural language, and has significant implications for applications such as robotic manipulation, visual question answering, and medical image analysis.
Papers
Empathic Grounding: Explorations using Multimodal Interaction and Large Language Models with Conversational Agents
Mehdi Arjmand, Farnaz Nouraei, Ian Steenstra, Timothy Bickmore
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations
Ruiyuan Lyu, Tai Wang, Jingli Lin, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, Jiangmiao Pang
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding
Yue Xu, Kaizhi Yang, Jiebo Luo, Xuejin Chen
F-LMM: Grounding Frozen Large Multimodal Models
Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
Daizong Liu, Yang Liu, Wencan Huang, Wei Hu
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman