PhD Proposal by Ram Ramrakhya

Title: Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents

Date: Wednesday, 3rd December 2025

Time: 3:00-5:00 PM

Zoom: https://gatech.zoom.us/j/8098069992

Ram Ramrakhya

Ph.D. Student

School of Interactive Computing

Georgia Institute of Technology

Committee members

Dr. Zsolt Kira (advisor): School of Interactive Computing, Georgia Institute of Technology

Dr. Dhruv Batra (advisor): School of Interactive Computing, Georgia Institute of Technology

Dr. James Hays: School of Interactive Computing, Georgia Institute of Technology

Dr. Larry Heck: School of Interactive Computing and ECE, Georgia Institute of Technology

Dr. Alex Toshev: Research Scientist and Manager at Apple MLR

Abstract

In this thesis, we will explore how foundation models pretrained on large-scale internet data that can follow instructions, reason, and edit data enable bootstrapping skill-specific supervision for training multi-modal agents, enabling distillation of novel skills without human labelled data. First, we show how vision–language models can enable converting unlabelled web images into labelled data to teach embodied agents spatial and semantic common-sense reasoning for object placement in indoor environments using supervised learning. Next, we demonstrate how large language models can be used to synthesize reward functions to enable reinforcement learning (RL) to distill skills which are hard to evaluate programmatically. Specifically, we demonstrate how to teach embodied agents to communicate in natural language and perform deductive reasoning for solving under-specified and ambiguous tasks using RL with synthetic rewards. Finally, we show LLMs can be equipped with tools to enable interaction with dynamic digital environments which allows us to autonomously generate diverse tasks through environment self-play. These tasks, paired with synthesized demonstrations and generative verifiers, can enable large-scale supervised finetuning and reinforcement learning for post-training LLMs as capable GUI-use agents. Together, these works illustrate the effectiveness of foundation models as capable supervisors, transforming raw data and pretrained knowledge into targeted learning signals for training capable multi-modal agents.

Media

No media selected

Summary

: Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents

Details

Wednesday

Dec 3 2025

03:00pm - 05:00pm

Location: ZOOM

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow status: Published
Created by: Tatianna Richardson
Created: 11/30/2025
Modified By: Tatianna Richardson
Modified: 11/30/2025

Mercury (Hg)