event

PhD Proposal by Jitesh Jain

Primary tabs

Title: Toward Multimodal Intelligence: Perception, Memory & Any-Horizon Reasoning

 

Jitesh Jain 

Ph.D. Student in Computer Science

School of Interactive Computing 

Georgia Institute of Technology 

https://praeclarumjj3.github.io/

 

Date: May 22, 12:00 - 2:00 PM EST

Location: Coda 1215

Zoom: https://gatech.zoom.us/j/9814414092?pwd=WnpkTjNhRHhYQlNzZGxTTW9SWmtJdz09

 

Committee:

Dr. Humphrey Shi (Advisor) - School of Interactive Computing, Georgia Institute of Technology

Dr. Zsolt Kira - School of Interactive Computing, Georgia Institute of Technology

Dr. Kartik Goyal - School of Interactive Computing, Georgia Institute of Technology

Dr. Judy Hoffman - Donald Bren School of Information and Computer Sciences, University of California, Irvine

Dr. Jianwei Yang - Member of Technical Staff, xAI

 

Abstract: Multimodal large language models have made impressive strides in language understanding and reasoning yet struggle with abilities that come naturally to humans: perceiving objects in cluttered scenes, remembering context across long interactions, and reasoning adaptively over extended time horizons. In this thesis, we argue that overcoming this gap requires integrating three capabilities that remain weak in current systems: visual perception, multimodal memory, and any-horizon reasoning.

 

We begin by identifying that vision-language models fail at basic object-level perception and show that incorporating structured segmentation and depth signals as visual inputs significantly improves performance. Second, we improve spatial reasoning more fundamentally by distilling expert visual knowledge into the model's internal representations during pre-training, with no added cost at inference. Third, we build a multimodal agent with a graph-structured cognitive memory that enables efficient retrieval of multimodal context across long conversations. Finally, we propose an adaptive agent system to reason over long videos, addressing the challenges of scalable data collection, system design and training recipe for open-ended video understanding.

Status

  • Workflow status: Published
  • Created by: Tatianna Richardson
  • Created: 05/08/2026
  • Modified By: Tatianna Richardson
  • Modified: 05/08/2026

Categories

Keywords

User Data

Target Audience