Paper ID: 2212.05923

Self-Supervised Object Goal Navigation with In-Situ Finetuning

So Yeon Min, Yao-Hung Hubert Tsai, Wei Ding, Ali Farhadi, Ruslan Salakhutdinov, Yonatan Bisk, Jian Zhang

A household robot should be able to navigate to target objects without requiring users to first annotate everything in their home. Most current approaches to object navigation do not test on real robots and rely solely on reconstructed scans of houses and their expensively labeled semantic 3D meshes. In this work, our goal is to build an agent that builds self-supervised models of the world via exploration, the same as a child might - thus we (1) eschew the expense of labeled 3D mesh and (2) enable self-supervised in-situ finetuning in the real world. We identify a strong source of self-supervision (Location Consistency - LocCon) that can train all components of an ObjectNav agent, using unannotated simulated houses. Our key insight is that embodied agents can leverage location consistency as a self-supervision signal - collecting images from different views/angles and applying contrastive learning. We show that our agent can perform competitively in the real world and simulation. Our results also indicate that supervised training with 3D mesh annotations causes models to learn simulation artifacts, which are not transferrable to the real world. In contrast, our LocCon shows the most robust transfer in the real world among the set of models we compare to, and that the real-world performance of all models can be further improved with self-supervised LocCon in-situ training.

Submitted: Dec 9, 2022