Visual Grounding
Visual grounding is the task of connecting natural language descriptions to corresponding regions within an image or 3D scene. Current research focuses on improving the accuracy and efficiency of visual grounding models, often employing transformer-based architectures and leveraging large multimodal language models (MLLMs) for enhanced feature fusion and reasoning capabilities. This field is crucial for advancing embodied AI, enabling robots and other agents to understand and interact with the world through natural language, and has significant implications for applications such as robotic manipulation, visual question answering, and medical image analysis.
Papers
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding
Jinlong He, Pengfei Li, Gang Liu, Shenjun Zhong
Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding
Minghong Xie, Mengzhao Wang, Huafeng Li, Yafei Zhang, Dapeng Tao, Zhengtao Yu
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
Lingxiao Luo, Bingda Tang, Xuanzhong Chen, Rong Han, Ting Chen
Context-Infused Visual Grounding for Art
Selina Khan, Nanne van Noord
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu, Linchao Zhu, Yi Yang