Object Grounding
Object grounding aims to connect textual descriptions or commands to corresponding objects or regions within images or videos, bridging the gap between language and vision. Current research focuses on improving grounding accuracy and robustness across diverse modalities (audio, visual, 3D), employing techniques like contrastive learning, transformer networks, and graph-based reasoning to enhance model performance. This work is crucial for advancing applications such as robotics, autonomous driving, and multimodal AI, enabling more sophisticated interaction between humans and machines through natural language. Furthermore, ongoing efforts address biases in existing models and evaluate the effectiveness of grounding in reducing hallucinations in large language models.
Papers
Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments
Danial Kamali, Parisa Kordjamshidi
Fully Automated Task Management for Generation, Execution, and Evaluation: A Framework for Fetch-and-Carry Tasks with Natural Language Instructions in Continuous Space
Motonari Kambara, Komei Sugiura