<![CDATA[PhD Defense by Ram Ramrakhya]]>

688940 event 1773429050 1773663586 <![CDATA[PhD Defense by Ram Ramrakhya]]> Title: Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents

Date: Tuesday, 17th March 2026

Time: 3:30-5:00 PM

Zoom: https://gatech.zoom.us/j/8098069992

Ram Ramrakhya

Ph.D. Student

School of Interactive Computing

Georgia Institute of Technology

Committee members

Dr. Zsolt Kira (advisor): School of Interactive Computing, Georgia Institute of Technology

Dr. Dhruv Batra (advisor): School of Interactive Computing, Georgia Institute of Technology

Dr. James Hays: School of Interactive Computing, Georgia Institute of Technology

Dr. Larry Heck: School of Interactive Computing and ECE, Georgia Institute of Technology

Dr. Alex Toshev: Research Scientist and Manager at Apple MLR

Abstract

In this thesis, we will explore how foundation models pretrained on large-scale internet data that can follow instructions, reason, and edit data enable bootstrapping skill-specific supervision for training multi-modal agents, enabling distillation of novel skills without human labelled data. First, we show how vision–language models can enable converting unlabelled web images into labelled data to teach embodied agents spatial and semantic common-sense reasoning for object placement in indoor environments using supervised learning. Next, we demonstrate how large language models can be used to synthesize reward functions to enable reinforcement learning (RL) to distill skills which are hard to evaluate programmatically. Specifically, we demonstrate how to teach embodied agents to communicate in natural language and perform deductive reasoning for solving under-specified and ambiguous tasks using RL with synthetic rewards. Finally, we show LLMs can be equipped with tools to enable interaction with dynamic digital environments which allows us to autonomously generate diverse tasks through environment self-play. These tasks, paired with synthesized demonstrations and generative verifiers, can enable large-scale supervised finetuning and reinforcement learning for post-training LLMs as capable GUI-use agents. Together, these works illustrate the effectiveness of foundation models as capable supervisors, transforming raw data and pretrained knowledge into targeted learning signals for training capable multi-modal agents.

]]> Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents

]]> <![CDATA[]]> 221981 1788 100811