<node id="690203">
  <nid>690203</nid>
  <type>event</type>
  <uid>
    <user id="27707"><![CDATA[27707]]></user>
  </uid>
  <created>1778261228</created>
  <changed>1778261256</changed>
  <title><![CDATA[PhD Proposal by Jitesh Jain ]]></title>
  <body><![CDATA[<p><strong>Title:</strong>&nbsp;Toward Multimodal Intelligence: Perception, Memory &amp; Any-Horizon Reasoning</p><p>&nbsp;</p><p>Jitesh Jain&nbsp;</p><p>Ph.D. Student in Computer Science</p><p>School of Interactive Computing&nbsp;</p><p>Georgia Institute of Technology&nbsp;</p><p><a href="https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpraeclarumjj3.github.io%2F&amp;data=05%7C02%7Ctm186%40gtvault.onmicrosoft.com%7C7f59adeba41642bcc5fb08dead245ad8%7C482198bbae7b4b258b7a6d7f32faa083%7C1%7C0%7C639138568815976263%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&amp;sdata=%2F%2FAxXFbGuXGjDnBfs78Var04h02hrIl%2FTCsBXecDboo%3D&amp;reserved=0">https://praeclarumjj3.github.io/</a></p><p>&nbsp;</p><p><strong>Date:</strong>&nbsp;May 22, 12:00 - 2:00 PM EST</p><p><strong>Location:</strong>&nbsp;Coda 1215</p><p>Zoom: <a href="https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgatech.zoom.us%2Fj%2F9814414092%3Fpwd%3DWnpkTjNhRHhYQlNzZGxTTW9SWmtJdz09&amp;data=05%7C02%7Ctm186%40gtvault.onmicrosoft.com%7C7f59adeba41642bcc5fb08dead245ad8%7C482198bbae7b4b258b7a6d7f32faa083%7C1%7C0%7C639138568816002384%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&amp;sdata=mS7RP9bMeV0cYEjKB5mU1rGL8WkLEVY5V6Uhh%2BXu55c%3D&amp;reserved=0">https://gatech.zoom.us/j/9814414092?pwd=WnpkTjNhRHhYQlNzZGxTTW9SWmtJdz09</a></p><p>&nbsp;</p><p><strong>Committee:</strong></p><p>Dr. Humphrey Shi (Advisor) - School of Interactive Computing, Georgia Institute of Technology</p><p>Dr. Zsolt Kira - School of Interactive Computing, Georgia Institute of Technology</p><p>Dr. Kartik Goyal - School of Interactive Computing, Georgia Institute of Technology</p><p>Dr. Judy Hoffman - Donald Bren School of Information and Computer Sciences, University of California, Irvine</p><p>Dr. Jianwei Yang - Member of Technical Staff, xAI</p><p>&nbsp;</p><p><strong>Abstract:&nbsp;</strong>Multimodal large language models have made impressive strides in language understanding and reasoning yet struggle with abilities that come naturally to humans: perceiving objects in cluttered scenes, remembering context across long interactions, and reasoning adaptively over extended time horizons. In this thesis, we argue that overcoming this gap requires integrating three capabilities that remain weak in current systems: visual perception, multimodal memory, and any-horizon reasoning.</p><p>&nbsp;</p><p>We begin by identifying that vision-language models fail at basic object-level perception and show that incorporating structured segmentation and depth signals as visual inputs significantly improves performance. Second, we improve spatial reasoning more fundamentally by distilling expert visual knowledge into the model's internal representations during pre-training, with no added cost at inference. Third, we build a multimodal agent with a graph-structured cognitive memory that enables efficient retrieval of multimodal context across long conversations. Finally, we propose an adaptive agent system to reason over long videos, addressing the challenges of scalable data collection, system design and training recipe for open-ended video understanding.</p>]]></body>
  <field_summary_sentence>
    <item>
      <value><![CDATA[Toward Multimodal Intelligence: Perception, Memory & Any-Horizon Reasoning]]></value>
    </item>
  </field_summary_sentence>
  <field_summary>
    <item>
      <value><![CDATA[<p>Toward Multimodal Intelligence: Perception, Memory &amp; Any-Horizon Reasoning</p>]]></value>
    </item>
  </field_summary>
  <field_time>
    <item>
      <value><![CDATA[2026-05-22T12:00:00-04:00]]></value>
      <value2><![CDATA[2026-05-22T14:00:26-04:00]]></value2>
      <rrule><![CDATA[]]></rrule>
      <timezone><![CDATA[America/New_York]]></timezone>
    </item>
  </field_time>
  <field_fee>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_fee>
  <field_extras>
      </field_extras>
  <field_audience>
          <item>
        <value><![CDATA[Public]]></value>
      </item>
      </field_audience>
  <field_media>
      </field_media>
  <field_contact>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_contact>
  <field_location>
    <item>
      <value><![CDATA[Coda 1215]]></value>
    </item>
  </field_location>
  <field_sidebar>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_sidebar>
  <field_phone>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_phone>
  <field_url>
    <item>
      <url><![CDATA[]]></url>
      <title><![CDATA[]]></title>
            <attributes><![CDATA[]]></attributes>
    </item>
  </field_url>
  <field_email>
    <item>
      <email><![CDATA[]]></email>
    </item>
  </field_email>
  <field_boilerplate>
    <item>
      <nid><![CDATA[]]></nid>
    </item>
  </field_boilerplate>
  <links_related>
      </links_related>
  <files>
      </files>
  <og_groups>
          <item>221981</item>
      </og_groups>
  <og_groups_both>
          <item><![CDATA[Graduate Studies]]></item>
      </og_groups_both>
  <field_categories>
          <item>
        <tid>1788</tid>
        <value><![CDATA[Other/Miscellaneous]]></value>
      </item>
      </field_categories>
  <field_keywords>
          <item>
        <tid>102851</tid>
        <value><![CDATA[Phd proposal]]></value>
      </item>
      </field_keywords>
  <field_userdata><![CDATA[]]></field_userdata>
</node>
