<node id="688940">
  <nid>688940</nid>
  <type>event</type>
  <uid>
    <user id="27707"><![CDATA[27707]]></user>
  </uid>
  <created>1773429050</created>
  <changed>1773663586</changed>
  <title><![CDATA[PhD Defense by  Ram Ramrakhya]]></title>
  <body><![CDATA[<p><strong>Title:</strong>&nbsp; Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents</p><p><strong>Date:</strong>&nbsp;Tuesday, 17th&nbsp; March 2026</p><p><strong>Time:</strong>&nbsp;3:30-5:00 PM</p><p><strong>Zoom:</strong>&nbsp;<a href="https://gatech.zoom.us/j/8098069992?pwd=YW1SenhpWkgrdmRMaDk4STBNTzZDUT09" target="_blank" title="https://gatech.zoom.us/j/8098069992?pwd=YW1SenhpWkgrdmRMaDk4STBNTzZDUT09">https://gatech.zoom.us/j/8098069992</a></p><p>&nbsp;</p><p><strong>Ram Ramrakhya</strong></p><p>Ph.D. Student</p><p>School of Interactive Computing</p><p>Georgia Institute of Technology</p><p>&nbsp;</p><p><strong>Committee members</strong></p><p>Dr. Zsolt Kira&nbsp;(advisor): School of Interactive&nbsp;Computing, Georgia Institute of Technology</p><p>Dr. Dhruv Batra (advisor): School of Interactive&nbsp;Computing, Georgia Institute of Technology</p><p>Dr. James Hays: School of Interactive&nbsp;Computing, Georgia Institute of Technology</p><p>Dr. Larry Heck: School of Interactive&nbsp;Computing and ECE, Georgia Institute of Technology</p><p>Dr. Alex Toshev: Research Scientist and Manager at Apple MLR</p><p>&nbsp;</p><p><strong>Abstract</strong></p><p>In this thesis, we will explore how foundation models pretrained on large-scale internet data that can follow instructions, reason, and edit data enable bootstrapping skill-specific supervision for training multi-modal agents, enabling distillation of novel skills without human labelled data. First, we show how vision–language models can enable converting unlabelled web images into labelled data to teach embodied agents spatial and semantic common-sense reasoning for object placement in indoor environments using supervised learning. Next, we demonstrate how large language models can be used to synthesize reward functions to enable reinforcement learning (RL) to distill skills which are hard to evaluate programmatically. Specifically, we demonstrate how to teach&nbsp;embodied agents&nbsp;to communicate in natural language and perform deductive reasoning for solving under-specified and ambiguous tasks using RL&nbsp;with synthetic rewards. Finally, we show LLMs can be equipped with tools to enable interaction with dynamic digital environments which allows us to autonomously generate diverse tasks through environment self-play. These tasks, paired with synthesized demonstrations and generative verifiers, can enable large-scale supervised finetuning and reinforcement learning for post-training LLMs as capable GUI-use agents. Together, these works illustrate the effectiveness of foundation models as capable supervisors, transforming raw data and pretrained knowledge into targeted learning signals for training capable multi-modal agents.</p><p>&nbsp;</p><p>&nbsp;</p>]]></body>
  <field_summary_sentence>
    <item>
      <value><![CDATA[Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents]]></value>
    </item>
  </field_summary_sentence>
  <field_summary>
    <item>
      <value><![CDATA[<p>Internet-Scale Pretraining Enables Bootstrapping Skill-Specific Supervision for Training Multi-Modal Agents</p>]]></value>
    </item>
  </field_summary>
  <field_time>
    <item>
      <value><![CDATA[2026-03-17T15:30:00-04:00]]></value>
      <value2><![CDATA[2026-03-17T17:00:00-04:00]]></value2>
      <rrule><![CDATA[]]></rrule>
      <timezone><![CDATA[America/New_York]]></timezone>
    </item>
  </field_time>
  <field_fee>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_fee>
  <field_extras>
      </field_extras>
  <field_audience>
          <item>
        <value><![CDATA[Public]]></value>
      </item>
      </field_audience>
  <field_media>
      </field_media>
  <field_contact>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_contact>
  <field_location>
    <item>
      <value><![CDATA[ZOOM]]></value>
    </item>
  </field_location>
  <field_sidebar>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_sidebar>
  <field_phone>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_phone>
  <field_url>
    <item>
      <url><![CDATA[]]></url>
      <title><![CDATA[]]></title>
            <attributes><![CDATA[]]></attributes>
    </item>
  </field_url>
  <field_email>
    <item>
      <email><![CDATA[]]></email>
    </item>
  </field_email>
  <field_boilerplate>
    <item>
      <nid><![CDATA[]]></nid>
    </item>
  </field_boilerplate>
  <links_related>
      </links_related>
  <files>
      </files>
  <og_groups>
          <item>221981</item>
      </og_groups>
  <og_groups_both>
          <item><![CDATA[Graduate Studies]]></item>
      </og_groups_both>
  <field_categories>
          <item>
        <tid>1788</tid>
        <value><![CDATA[Other/Miscellaneous]]></value>
      </item>
      </field_categories>
  <field_keywords>
          <item>
        <tid>100811</tid>
        <value><![CDATA[Phd Defense]]></value>
      </item>
      </field_keywords>
  <field_userdata><![CDATA[]]></field_userdata>
</node>
