Visual Grounding
Visual grounding is the task of connecting natural language descriptions to corresponding regions within an image or 3D scene. Current research focuses on improving the accuracy and efficiency of visual grounding models, often employing transformer-based architectures and leveraging large multimodal language models (MLLMs) for enhanced feature fusion and reasoning capabilities. This field is crucial for advancing embodied AI, enabling robots and other agents to understand and interact with the world through natural language, and has significant implications for applications such as robotic manipulation, visual question answering, and medical image analysis.
Papers
May 18, 2023
May 15, 2023
April 20, 2023
April 12, 2023
March 29, 2023
March 23, 2023
March 21, 2023
March 13, 2023
March 7, 2023
February 24, 2023
February 22, 2023
January 22, 2023
January 20, 2023
December 19, 2022
December 1, 2022
November 28, 2022
November 25, 2022
November 15, 2022