PhD Defense by Meera Hahn

Title: Language Guided Localization and Navigation
Date: Friday, July 8th 2022
Time: 4-6pm (ET)
Location (virtual): https://gatech.zoom.us/j/92706895425?pwd=VVI0Y2lqRnVmYUFLbEIxVXNMTFpPQT09

Meera Hahn
School of Interactive Computing
College of Computing
Georgia Institute of Technology

Committee:
Dr. James M. Rehg (advisor), School of Interactive Computing, Georgia Institute of Technology
Dr. Dhruv Batra, School of Interactive Computing, Georgia Institute of Technology
Dr. Diyi Yang, School of Interactive Computing, Georgia Institute of Technology
Dr. Abhinav Gupta, The Robotics Institute, Carnegie Mellon University
Dr. Peter Anderson, Google

Abstract:
Embodied tasks that require active perception are key to improving language grounding models and creating holistic social agents. In this dissertation we explore four multi-modal embodied perception tasks which require localization or navigation of an agent in an unknown temporal or 3D space with limited information about the environment. We first explore how an agent can be guided by language to navigate a temporal space using reinforcement learning in a similar way to that of a 3D space. Next, we explore how to teach an agent to navigate using only self-supervised learning from passive data. In this task we remove the complexity of language and explore a topological map and graph-network based strategy for navigation. We then present the Where Are You? (WAY) dataset which contains over 6k dialogs of two humans performing a localization task. On top of this dataset, we design three tasks which push the envelope of current visual language-grounding tasks by introducing a multi-agent set up in which agents are required to use active perception to communicate, navigate, and localize. We specifically focus on modeling one of these tasks, Localization from Embodied Dialog (LED). The LED task involves taking a natural language dialog of two agents -- an observer and a locator -- and predicting the location of the observer agent. We find that a topological graph map of the environments is a successful representation for modeling the complex relational structure of the dialog and observer locations. We validate our approach on several state of the art multi-modal baselines and show that a multi-modal transformer with large-scale pretraining outperforms all other models. We additionally introduce a novel analysis pipeline on this model for the LED and the Vision Language Navigation (VLN) task to diagnose and reveal limitations and failure modes of these types of models.

Media

No media selected

Summary

Details

Friday

Jul 8 2022

05:00pm - 07:00pm

URL: https://gatech.zoom.us/j/92706895425?pwd=VVI0Y2lqRnVmYUFLbEIxVXNMTFpPQT09#success

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow Status:Published
Created By:Tatianna Richardson
Created:07/06/2022
Modified By:Tatianna Richardson
Modified:07/06/2022

Mercury (Hg)

PhD Defense by Meera Hahn

Log in

Georgia Institute of Technology

PhD Defense by Meera Hahn

Primary tabs

Log in

Georgia Institute of Technology