PhD Proposal by Jitesh Jain

Title: Toward Multimodal Intelligence: Perception, Memory & Any-Horizon Reasoning

Jitesh Jain

Ph.D. Student in Computer Science

School of Interactive Computing

Georgia Institute of Technology

https://praeclarumjj3.github.io/

Date: May 22, 12:00 - 2:00 PM EST

Location: Coda 1215

Zoom: https://gatech.zoom.us/j/9814414092?pwd=WnpkTjNhRHhYQlNzZGxTTW9SWmtJdz09

Committee:

Dr. Humphrey Shi (Advisor) - School of Interactive Computing, Georgia Institute of Technology

Dr. Zsolt Kira - School of Interactive Computing, Georgia Institute of Technology

Dr. Kartik Goyal - School of Interactive Computing, Georgia Institute of Technology

Dr. Judy Hoffman - Donald Bren School of Information and Computer Sciences, University of California, Irvine

Dr. Jianwei Yang - Member of Technical Staff, xAI

Abstract: Multimodal large language models have made impressive strides in language understanding and reasoning yet struggle with abilities that come naturally to humans: perceiving objects in cluttered scenes, remembering context across long interactions, and reasoning adaptively over extended time horizons. In this thesis, we argue that overcoming this gap requires integrating three capabilities that remain weak in current systems: visual perception, multimodal memory, and any-horizon reasoning.

We begin by identifying that vision-language models fail at basic object-level perception and show that incorporating structured segmentation and depth signals as visual inputs significantly improves performance. Second, we improve spatial reasoning more fundamentally by distilling expert visual knowledge into the model's internal representations during pre-training, with no added cost at inference. Third, we build a multimodal agent with a graph-structured cognitive memory that enables efficient retrieval of multimodal context across long conversations. Finally, we propose an adaptive agent system to reason over long videos, addressing the challenges of scalable data collection, system design and training recipe for open-ended video understanding.

Media

No media selected

Summary

Toward Multimodal Intelligence: Perception, Memory & Any-Horizon Reasoning

Details

Friday

May 22 2026

12:00pm - 02:00pm

Location: Coda 1215

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow status: Published
Created by: Tatianna Richardson
Created: 05/08/2026
Modified By: Tatianna Richardson
Modified: 05/08/2026

Mercury (Hg)