ABSTRACT: In this talk I will show how we can design modular agents for visual navigation that can perform tasks specified by natural language instructions, perform efficient exploration and long-term planning, build and utilize 3D semantic maps, while generalizing across domains and tasks. Specifically, I will first introduce a novel framework called Self-supervised Embodied Active Learning (SEAL) that builds and utilizes 3D semantic maps to learn both action and perception in a self-supervised manner. I will show that the SEAL framework can be used to close the action-perception loop: it improves object detection and instance segmentation performance of a pretrained perception model by moving around in training environments, while the improved perception model can be used to improve on object goal navigation tasks. I will next introduce a novel embodied instruction following method that uses structured representations to build a semantic map of the scene and perform exploration with a semantic search po
Hide player controls
Hide resume playing