PhD Defense by Meera Hahn

Primary tabs

Title: Language Guided Localization and Navigation
Date: Friday, July 8th 2022
Time: 4-6pm (ET)
Location (virtual): https://gatech.zoom.us/j/92706895425?pwd=VVI0Y2lqRnVmYUFLbEIxVXNMTFpPQT09

Meera Hahn
School of Interactive Computing
College of Computing
Georgia Institute of Technology

Dr. James M. Rehg (advisor), School of Interactive Computing, Georgia Institute of Technology
Dr. Dhruv Batra, School of Interactive Computing, Georgia Institute of Technology
Dr. Diyi Yang, School of Interactive Computing, Georgia Institute of Technology
Dr. Abhinav Gupta, The Robotics Institute, Carnegie Mellon University
Dr. Peter Anderson, Google

Embodied tasks that require active perception are key to improving language grounding models and creating holistic social agents. In this dissertation we explore four multi-modal embodied perception tasks which require localization or navigation of an agent in an unknown temporal or 3D space with limited information about the environment. We first explore how an agent can be guided by language to navigate a temporal space using reinforcement learning in a similar way to that of a 3D space. Next, we explore how to teach an agent to navigate using only self-supervised learning from passive data. In this task we remove the complexity of language and explore a topological map and graph-network based strategy for navigation. We then present the Where Are You? (WAY) dataset which contains over 6k dialogs of two humans performing a localization task. On top of this dataset, we design three tasks which push the envelope of current visual language-grounding tasks by introducing a multi-agent set up in which agents are required to use active perception to communicate, navigate, and localize. We specifically focus on modeling one of these tasks, Localization from Embodied Dialog (LED). The LED task involves taking a natural language dialog of two agents -- an observer and a locator -- and predicting the location of the observer agent. We find that a topological graph map of the environments is a successful representation for modeling the complex relational structure of the dialog and observer locations. We validate our approach on several state of the art multi-modal baselines and show that a multi-modal transformer with large-scale pretraining outperforms all other models. We additionally introduce a novel analysis pipeline on this model for the LED and the Vision Language Navigation (VLN) task to diagnose and reveal limitations and failure modes of these types of models.



  • Workflow Status:
  • Created By:
    Tatianna Richardson
  • Created:
  • Modified By:
    Tatianna Richardson
  • Modified:


Target Audience

    No target audience selected.