event
PhD Defense by Fiona Ryan
Primary tabs
Title: Towards Human-Centric Perception: Grounding Human Behavior in Multimodal Context
Date: Tuesday, April 7th 2026
Time: 3:00-5:00 PM ET
Location: Coda 0915 & Zoom (https://gatech.zoom.us/j/95248425147)
Fiona Ryan
Ph.D. Student
School of Interactive Computing
Georgia Institute of Technology
Committee
Dr. Judy Hoffman (Advisor) - School of Interactive Computing, Georgia Institute of Technology
Dr. James Rehg (Advisor) - School of Interactive Computing, Georgia Institute of Technology
Dr. James Hays - School of Interactive Computing, Georgia Institute of Technology
Dr. Zsolt Kira - School of Interactive Computing, Georgia Institute of Technology
Dr. Josef Sivic - Czech Institute of Informatics, Robotics, and Cybernetics, Czech Technical University in Prague
Abstract
Perceiving and understanding human behavior with computer vision is a core challenge for developing AI systems that can effectively interact with and assist people in everyday life. Modeling human behavior is challenging because it requires not only visually recognizing behaviors like gaze, gesture, and movement, but also grounding them in the context in which they occur. Human behavior is shaped by intent and higher-level goals, the surrounding physical environment, social interactions with other people, and additional modalities such as speech and language, making it inherently multimodal and situated.
This thesis explores how to model human behavior in context by addressing three core needs: (1) datasets that capture naturalistic human interactions in everyday environments, enabling new behavior modeling tasks, (2) multimodal methods that ground behavior by leveraging information across multiple modalities including vision, audio, and language, and (3) robust methods for recognizing behavioral cues that leverage advances in foundation models to encode context. First, I present contributions to large-scale multimodal egocentric datasets that capture social interactions and human object interactions during activities. Second, I present a modeling approach and dataset for the novel task of identifying targets of selective auditory attention during social conversations in noisy environments. Third, I present a method for efficiently adapting vision-language retrieval models to represent new concepts and recognize them in different contexts. Fourth, I propose a framework for estimating gaze targets in scenes using the representation from a visual foundation model. Finally, I extend this framework to forecasting gaze behavior in egocentric video.
Groups
Status
- Workflow status: Published
- Created by: Tatianna Richardson
- Created: 03/19/2026
- Modified By: Tatianna Richardson
- Modified: 03/19/2026
Categories
Keywords
User Data
Target Audience