event
PhD Proposal by Arjun Majumdar
Primary tabs
Title: Large-scale Offline Pre-training to Enable Embodied Intelligence
Arjun Majumdar
Ph.D. Student in Computer Science
School of Interactive Computing
Georgia Institute of Technology
Date: November 29th, 2023
Time: 3:00pm - 5:00pm ET / 12:00pm - 2:00pm PT
Location: zoom link; Coda C1215 Midtown
Committee:
Dr. Dhruv Batra (Advisor) -- School of Interactive Computing, Georgia Institute of Technology
Dr. Zsolt Kira -- School of Interactive Computing, Georgia Institute of Technology
Dr. James Hays -- School of Interactive Computing, Georgia Institute of Technology
Dr. Jitendra Malik -- University of California Berkeley
Dr. Vincent Vanhoucke – Google DeepMind
Dr. Vladlen Koltun -- Apple
Abstract:
A central goal in Artificial Intelligence is building embodied agents (such as mobile robots) that are generalists -- capable of assisting with a wide-variety of tasks (specified in natural language) in any environment or setting. Such agents must understand a vast diversity of concepts in the visual world and be able to ground (or associate) this understanding with language to allow users to describe tasks and goals. How can we develop agents with such an extensive and functional understanding of the world?
In this thesis, we will argue that offline pre-training of foundation models on web-scale data enables embodied intelligence. In part 1, we present VC-1, a visual foundation model pre-trained (primarily) on video data collected from an egocentric perspective. We systematically demonstrate that such a model substantially benefits from increasing pre-training dataset diversity by introducing CortexBench, an embodied AI (EAI) benchmark curated from a diverse collection of existing EAI tasks (requiring locomotion, navigation, and dexterous and mobile manipulation of objects). In part 2, we first demonstrate that visual grounding learned from internet data (i.e., image-caption pairs from the web) can be transferred to an instruction-following visual navigation agent (VLN-BERT). Then, we present ZSON, a highly scalable approach for learning to visually navigate to objects specified in open-vocabulary, natural language instructions such as “find the kitchen sink.” The key idea is to leverage a pre-trained visiolinguistic embedding space (from CLIP) to decouple learning to represent semantic goals (such as a “a kitchen sink”) from learning to navigate to semantic goals. Finally, in proposed work, we will study combining vision-and-language models (VLMs) with large language models (LLMs) for the task of embodied question-answering (EQA), which requires an agent to answer open-ended questions about real-world environments.
Groups
Status
- Workflow Status:Published
- Created By:Tatianna Richardson
- Created:11/27/2023
- Modified By:Tatianna Richardson
- Modified:11/27/2023
Categories
Keywords
Target Audience