<![CDATA[Ph.D. Dissertation Defense

683253 event 1753311112 1753311207 <![CDATA[Ph.D. Dissertation Defense - Apoorva Beedu]]> Title: Learning Vision and Language Cues for Video Understanding in Egocentric and Instructional Videos

Committee:

Dr. Irfan Essa, CoC, Chair, Advisor

Dr. Justin Romberg, ECE, Co-Advisor

Dr. Thomas Ploetz, CoC

Dr. Larry Heck, ECE

Dr. Judy Hoffman, IC

Dr. Wei Xu, CoC

]]> We perceive the world through a combination of senses: such as sound, smell, and vision, to learn from and interact with
our surroundings. Among these, vision and hearing are the primary sources of information gathering, especially through
reading and listening. Effectively utilizing and combining these senses is key to developing intelligent systems that can
operate in and understand complex environments. A critical challenge hindering effective vision-language learning is an
understanding of why and how to effectively integrate language for improved video understanding.
In this dissertation, we leverage the language modality to learn effective video representations across a range of tasks,
including action recognition, forecasting, and summarization. The key ideas developed in this thesis are (i) VisionLanguage supervision for action understanding, and (ii) Leveraging language for video summarization.
In Vision-Language supervision for action understanding, we generate rich action descriptions and leverage information
from multiple modalities to recognize and anticipate future actions in videos. We also discover the extent to which
language contributes in understanding actions in videos, through effective cross-modal supervision between the vision
and language modalities.
Finally in Leveraging language for video summarization, we generate text outputs for every input modality, and evaluate
the performance of foundational models on video summarization task. By using text as the primary mode of input, we
evaluate how the text representations perform on video summarization. Building on this, we propose a hierarchical
framework that incorporates multi-granular language cues and evaluate its effectiveness for video summarization.

]]> <![CDATA[]]> https://gatech.zoom.us/j/3287180871?omn=93053535981 434381 1788 100811 1808