Paper ID: 2409.15505 • Published Sep 23, 2024
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
There has been a lot of interest in grounding natural language to physical
entities through visual context. While Vision Language Models (VLMs) can ground
linguistic instructions to visual sensory information, they struggle with
grounding non-visual attributes, like the weight of an object. Our key insight
is that non-visual attribute detection can be effectively achieved by active
perception guided by visual reasoning. To this end, we present a
perception-action API that consists of VLMs and Large Language Models (LLMs) as
backbones, together with a set of robot control functions. When prompted with
this API and a natural language query, an LLM generates a program to actively
identify attributes given an input image. Offline testing on the Odd-One-Out
dataset demonstrates that our framework outperforms vanilla VLMs in detecting
attributes like relative object location, size, and weight. Online testing in
realistic household scenes on AI2-THOR and a real robot demonstration on a DJI
RoboMaster EP robot highlight the efficacy of our approach.