event
PhD Proposal by Jitesh Jain
Primary tabs
Title: Toward Multimodal Intelligence: Perception, Memory & Any-Horizon Reasoning
Jitesh Jain
Ph.D. Student in Computer Science
School of Interactive Computing
Georgia Institute of Technology
https://praeclarumjj3.github.io/
Date: May 22, 12:00 - 2:00 PM EST
Location: Coda 1215
Zoom: https://gatech.zoom.us/j/9814414092?pwd=WnpkTjNhRHhYQlNzZGxTTW9SWmtJdz09
Committee:
Dr. Humphrey Shi (Advisor) - School of Interactive Computing, Georgia Institute of Technology
Dr. Zsolt Kira - School of Interactive Computing, Georgia Institute of Technology
Dr. Kartik Goyal - School of Interactive Computing, Georgia Institute of Technology
Dr. Judy Hoffman - Donald Bren School of Information and Computer Sciences, University of California, Irvine
Dr. Jianwei Yang - Member of Technical Staff, xAI
Abstract: Multimodal large language models have made impressive strides in language understanding and reasoning yet struggle with abilities that come naturally to humans: perceiving objects in cluttered scenes, remembering context across long interactions, and reasoning adaptively over extended time horizons. In this thesis, we argue that overcoming this gap requires integrating three capabilities that remain weak in current systems: visual perception, multimodal memory, and any-horizon reasoning.
We begin by identifying that vision-language models fail at basic object-level perception and show that incorporating structured segmentation and depth signals as visual inputs significantly improves performance. Second, we improve spatial reasoning more fundamentally by distilling expert visual knowledge into the model's internal representations during pre-training, with no added cost at inference. Third, we build a multimodal agent with a graph-structured cognitive memory that enables efficient retrieval of multimodal context across long conversations. Finally, we propose an adaptive agent system to reason over long videos, addressing the challenges of scalable data collection, system design and training recipe for open-ended video understanding.
Groups
Status
- Workflow status: Published
- Created by: Tatianna Richardson
- Created: 05/08/2026
- Modified By: Tatianna Richardson
- Modified: 05/08/2026
Categories
Keywords
User Data
Target Audience