Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents [2209.13156]