event
PhD Proposal by Ram Ramrakhya
Primary tabs
Title: Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents
Date: Wednesday, 3rd December 2025
Time: 3:00-5:00 PM
Zoom: https://gatech.zoom.us/j/8098069992
Ram Ramrakhya
Ph.D. Student
School of Interactive Computing
Georgia Institute of Technology
Committee members
Dr. Zsolt Kira (advisor): School of Interactive Computing, Georgia Institute of Technology
Dr. Dhruv Batra (advisor): School of Interactive Computing, Georgia Institute of Technology
Dr. James Hays: School of Interactive Computing, Georgia Institute of Technology
Dr. Larry Heck: School of Interactive Computing and ECE, Georgia Institute of Technology
Dr. Alex Toshev: Research Scientist and Manager at Apple MLR
Abstract
In this thesis, we will explore how foundation models pretrained on large-scale internet data that can follow instructions, reason, and edit data enable bootstrapping skill-specific supervision for training multi-modal agents, enabling distillation of novel skills without human labelled data. First, we show how vision–language models can enable converting unlabelled web images into labelled data to teach embodied agents spatial and semantic common-sense reasoning for object placement in indoor environments using supervised learning. Next, we demonstrate how large language models can be used to synthesize reward functions to enable reinforcement learning (RL) to distill skills which are hard to evaluate programmatically. Specifically, we demonstrate how to teach embodied agents to communicate in natural language and perform deductive reasoning for solving under-specified and ambiguous tasks using RL with synthetic rewards. Finally, we show LLMs can be equipped with tools to enable interaction with dynamic digital environments which allows us to autonomously generate diverse tasks through environment self-play. These tasks, paired with synthesized demonstrations and generative verifiers, can enable large-scale supervised finetuning and reinforcement learning for post-training LLMs as capable GUI-use agents. Together, these works illustrate the effectiveness of foundation models as capable supervisors, transforming raw data and pretrained knowledge into targeted learning signals for training capable multi-modal agents.
Groups
Status
- Workflow status: Published
- Created by: Tatianna Richardson
- Created: 11/30/2025
- Modified By: Tatianna Richardson
- Modified: 11/30/2025
Categories
Keywords
Target Audience