event

Phd Defense by Arjun Majumdar

Primary tabs

Title: Large-Scale Offline Pre-Training Bootstraps Embodied Intelligence

  

Arjun Majumdar

Ph.D. Student in Computer Science

School of Interactive Computing

Georgia Institute of Technology

 

Date: July 11th, 2024

Time: 12:00 - 1:30pm EST / 9:00 - 10:30 am PST

Location: zoom

Committee:

Dr. Dhruv Batra (Advisor) -- School of Interactive Computing, Georgia Institute of Technology

Dr. Zsolt Kira -- School of Interactive Computing, Georgia Institute of Technology

Dr. James Hays -- School of Interactive Computing, Georgia Institute of Technology

Dr. Jitendra Malik -- University of California Berkeley

Dr. Vincent Vanhoucke –- Google DeepMind

Dr. Vladlen Koltun -- Apple

 

Abstract:

A central goal in Artificial Intelligence is building embodied agents (such as mobile robots) that are generalists -- capable of assisting with a wide-variety of tasks (specified in natural language) in any environment or setting. Such agents must understand a vast diversity of concepts in the visual world and be able to ground (or associate) this understanding with language to allow users to describe tasks and goals. How can we develop agents with such an extensive and functional understanding of the world?

 

In this thesis, we will argue that offline pre-training of foundation models on web-scale data enables embodied intelligence. In part 1, we present VC-1, a visual foundation model pre-trained (primarily) on video data collected from an egocentric perspective. We systematically demonstrate that such a model substantially benefits from increasing pre-training dataset diversity by introducing CortexBench, an embodied AI (EAI) benchmark curated from a diverse collection of existing EAI tasks (requiring locomotion, navigation, and dexterous and mobile manipulation of objects). In part 2, we first demonstrate that visual grounding learned from internet data (i.e., image-caption pairs from the web) can be transferred to an instruction-following visual navigation agent (VLN-BERT). Then, we present ZSON, a highly scalable approach for learning to visually navigate to objects specified in open-vocabulary, natural language instructions such as “find the kitchen sink.” The key idea is to leverage a pre-trained visiolinguistic embedding space (from CLIP) to decouple learning to represent semantic goals (such as a “a kitchen sink”) from learning to navigate to semantic goals. In part 3, we present a modern formulation of the Embodied Question Answering (EQA) task, which requires understanding a 3D environment well enough to answer questions about it in natural language. We introduce a new benchmark (OpenEQA) and study a modular agent that leverages pre-trained components such as vision-language models (VLMs) to address the EQA task.

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:06/26/2024
  • Modified By:Tatianna Richardson
  • Modified:06/26/2024

Categories

Keywords

Target Audience