<node id="683253">
  <nid>683253</nid>
  <type>event</type>
  <uid>
    <user id="28475"><![CDATA[28475]]></user>
  </uid>
  <created>1753311112</created>
  <changed>1753311207</changed>
  <title><![CDATA[Ph.D. Dissertation Defense - Apoorva Beedu]]></title>
  <body><![CDATA[<p><strong>Title</strong><em>:&nbsp; Learning Vision and Language Cues for Video Understanding in Egocentric and Instructional Videos</em></p><p><strong>Committee:</strong></p><p>Dr.&nbsp;Irfan Essa, CoC, Chair, Advisor</p><p>Dr.&nbsp;Justin Romberg, ECE, Co-Advisor</p><p>Dr.&nbsp;Thomas Ploetz, CoC</p><p>Dr.&nbsp;Larry Heck, ECE</p><p>Dr.&nbsp;Judy Hoffman, IC</p><p>Dr.&nbsp;Wei Xu, CoC</p>]]></body>
  <field_summary_sentence>
    <item>
      <value><![CDATA[Learning Vision and Language Cues for Video Understanding in Egocentric and Instructional Videos ]]></value>
    </item>
  </field_summary_sentence>
  <field_summary>
    <item>
      <value><![CDATA[<p>We perceive the world through a combination of senses: such as sound, smell, and vision, to learn from and interact with<br>our surroundings. Among these, vision and hearing are the primary sources of information gathering, especially through<br>reading and listening. Effectively utilizing and combining these senses is key to developing intelligent systems that can<br>operate in and understand complex environments. A critical challenge hindering effective vision-language learning is an<br>understanding of why and how to effectively integrate language for improved video understanding.<br>In this dissertation, we leverage the language modality to learn effective video representations across a range of tasks,<br>including action recognition, forecasting, and summarization. The key ideas developed in this thesis are (i) VisionLanguage supervision for action understanding, and (ii) Leveraging language for video summarization.<br>In Vision-Language supervision for action understanding, we generate rich action descriptions and leverage information<br>from multiple modalities to recognize and anticipate future actions in videos. We also discover the extent to which<br>language contributes in understanding actions in videos, through effective cross-modal supervision between the vision<br>and language modalities.<br>Finally in Leveraging language for video summarization, we generate text outputs for every input modality, and evaluate<br>the performance of foundational models on video summarization task. By using text as the primary mode of input, we<br>evaluate how the text representations perform on video summarization. Building on this, we propose a hierarchical<br>framework that incorporates multi-granular language cues and evaluate its effectiveness for video summarization.</p>]]></value>
    </item>
  </field_summary>
  <field_time>
    <item>
      <value><![CDATA[2025-07-28T14:00:00-04:00]]></value>
      <value2><![CDATA[2025-07-28T16:00:00-04:00]]></value2>
      <rrule><![CDATA[]]></rrule>
      <timezone><![CDATA[America/New_York]]></timezone>
    </item>
  </field_time>
  <field_fee>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_fee>
  <field_extras>
      </field_extras>
  <field_audience>
          <item>
        <value><![CDATA[Public]]></value>
      </item>
      </field_audience>
  <field_media>
      </field_media>
  <field_contact>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_contact>
  <field_location>
    <item>
      <value><![CDATA[Room C1215 CODA (Midtown)]]></value>
    </item>
  </field_location>
  <field_sidebar>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_sidebar>
  <field_phone>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_phone>
  <field_url>
    <item>
      <url><![CDATA[]]></url>
      <title><![CDATA[]]></title>
            <attributes><![CDATA[]]></attributes>
    </item>
  </field_url>
  <field_email>
    <item>
      <email><![CDATA[]]></email>
    </item>
  </field_email>
  <field_boilerplate>
    <item>
      <nid><![CDATA[]]></nid>
    </item>
  </field_boilerplate>
  <links_related>
          <item>
        <url>https://gatech.zoom.us/j/3287180871?omn=93053535981</url>
        <link_title><![CDATA[Zoom link]]></link_title>
      </item>
      </links_related>
  <files>
      </files>
  <og_groups>
          <item>434381</item>
      </og_groups>
  <og_groups_both>
          <item><![CDATA[ECE Ph.D. Dissertation Defenses]]></item>
      </og_groups_both>
  <field_categories>
          <item>
        <tid>1788</tid>
        <value><![CDATA[Other/Miscellaneous]]></value>
      </item>
      </field_categories>
  <field_keywords>
          <item>
        <tid>100811</tid>
        <value><![CDATA[Phd Defense]]></value>
      </item>
          <item>
        <tid>1808</tid>
        <value><![CDATA[graduate students]]></value>
      </item>
      </field_keywords>
  <field_userdata><![CDATA[]]></field_userdata>
</node>
