event

PhD Proposal by Arjun Majumdar

Primary tabs

Title: Large-scale Offline Pre-training to Enable Embodied Intelligence

 

Arjun Majumdar

Ph.D. Student in Computer Science

School of Interactive Computing 

Georgia Institute of Technology

 

Date: November 29th, 2023

Time: 3:00pm - 5:00pm ET / 12:00pm - 2:00pm PT

Location: zoom link; Coda C1215 Midtown

Committee:

Dr. Dhruv Batra (Advisor) -- School of Interactive Computing, Georgia Institute of Technology

Dr. Zsolt Kira -- School of Interactive Computing, Georgia Institute of Technology

Dr. James Hays -- School of Interactive Computing, Georgia Institute of Technology

Dr. Jitendra Malik -- University of California Berkeley

Dr. Vincent Vanhoucke – Google DeepMind

Dr. Vladlen Koltun -- Apple

 

Abstract:

A central goal in Artificial Intelligence is building embodied agents (such as mobile robots) that are generalists -- capable of assisting with a wide-variety of tasks (specified in natural language) in any environment or setting. Such agents must understand a vast diversity of concepts in the visual world and be able to ground (or associate) this understanding with language to allow users to describe tasks and goals. How can we develop agents with such an extensive and functional understanding of the world?

 

In this thesis, we will argue that offline pre-training of foundation models on web-scale data enables embodied intelligence. In part 1, we present VC-1, a visual foundation model pre-trained (primarily) on video data collected from an egocentric perspective. We systematically demonstrate that such a model substantially benefits from increasing pre-training dataset diversity by introducing CortexBench, an embodied AI (EAI) benchmark curated from a diverse collection of existing EAI tasks (requiring locomotion, navigation, and dexterous and mobile manipulation of objects). In part 2, we first demonstrate that visual grounding learned from internet data (i.e., image-caption pairs from the web) can be transferred to an instruction-following visual navigation agent (VLN-BERT). Then, we present ZSON, a highly scalable approach for learning to visually navigate to objects specified in open-vocabulary, natural language instructions such as “find the kitchen sink.” The key idea is to leverage a pre-trained visiolinguistic embedding space (from CLIP) to decouple learning to represent semantic goals (such as a “a kitchen sink”) from learning to navigate to semantic goals. Finally, in proposed work, we will study combining vision-and-language models (VLMs) with large language models (LLMs) for the task of embodied question-answering (EQA), which requires an agent to answer open-ended questions about real-world environments.

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:11/27/2023
  • Modified By:Tatianna Richardson
  • Modified:11/27/2023

Categories

Keywords

Target Audience