Paper ID: 2404.09647

Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map

Taichi Sakaguchi, Akira Taniguchi, Yoshinobu Hagiwara, Lotfi El Hafi, Shoichi Hasegawa, Tadahiro Taniguchi

Robots that assist humans in their daily lives should be able to locate specific instances of objects in an environment that match a user's desired objects. This task is known as instance-specific image goal navigation (InstanceImageNav), which requires a model that can distinguish different instances of an object within the same class. A significant challenge in robotics is that when a robot observes the same object from various 3D viewpoints, its appearance may differ significantly, making it difficult to recognize and locate accurately. In this paper, we introduce a method called SimView, which leverages multi-view images based on a 3D semantic map of an environment and self-supervised learning using SimSiam to train an instance-identification model on-site. The effectiveness of our approach was validated using a photorealistic simulator, Habitat Matterport 3D, created by scanning actual home environments. Our results demonstrate a 1.7-fold improvement in task accuracy compared with contrastive language-image pre-training (CLIP), a pre-trained multimodal contrastive learning method for object searching. This improvement highlights the benefits of our proposed fine-tuning method in enhancing the performance of assistive robots in InstanceImageNav tasks. The project website is this https URL.

Submitted: Apr 15, 2024