event
Phd Defense by Arjun Majumdar
Primary tabs
Title: Large-Scale Offline Pre-Training Bootstraps Embodied Intelligence
Arjun Majumdar
Ph.D. Student in Computer Science
School of Interactive Computing
Georgia Institute of Technology
Date: July 11th, 2024
Time: 12:00 - 1:30pm EST / 9:00 - 10:30 am PST
Location: zoom
Committee:
Dr. Dhruv Batra (Advisor) -- School of Interactive Computing, Georgia Institute of Technology
Dr. Zsolt Kira -- School of Interactive Computing, Georgia Institute of Technology
Dr. James Hays -- School of Interactive Computing, Georgia Institute of Technology
Dr. Jitendra Malik -- University of California Berkeley
Dr. Vincent Vanhoucke –- Google DeepMind
Dr. Vladlen Koltun -- Apple
Abstract:
A central goal in Artificial Intelligence is building embodied agents (such as mobile robots) that are generalists -- capable of assisting with a wide-variety of tasks (specified in natural language) in any environment or setting. Such agents must understand a vast diversity of concepts in the visual world and be able to ground (or associate) this understanding with language to allow users to describe tasks and goals. How can we develop agents with such an extensive and functional understanding of the world?
In this thesis, we will argue that offline pre-training of foundation models on web-scale data enables embodied intelligence. In part 1, we present VC-1, a visual foundation model pre-trained (primarily) on video data collected from an egocentric perspective. We systematically demonstrate that such a model substantially benefits from increasing pre-training dataset diversity by introducing CortexBench, an embodied AI (EAI) benchmark curated from a diverse collection of existing EAI tasks (requiring locomotion, navigation, and dexterous and mobile manipulation of objects). In part 2, we first demonstrate that visual grounding learned from internet data (i.e., image-caption pairs from the web) can be transferred to an instruction-following visual navigation agent (VLN-BERT). Then, we present ZSON, a highly scalable approach for learning to visually navigate to objects specified in open-vocabulary, natural language instructions such as “find the kitchen sink.” The key idea is to leverage a pre-trained visiolinguistic embedding space (from CLIP) to decouple learning to represent semantic goals (such as a “a kitchen sink”) from learning to navigate to semantic goals. In part 3, we present a modern formulation of the Embodied Question Answering (EQA) task, which requires understanding a 3D environment well enough to answer questions about it in natural language. We introduce a new benchmark (OpenEQA) and study a modular agent that leverages pre-trained components such as vision-language models (VLMs) to address the EQA task.
Groups
Status
- Workflow Status:Published
- Created By:Tatianna Richardson
- Created:06/26/2024
- Modified By:Tatianna Richardson
- Modified:06/26/2024
Categories
Keywords
Target Audience