<![CDATA[PhD Proposal by Jitesh Jain ]]>

690203 event 1778261228 1778261256 <![CDATA[PhD Proposal by Jitesh Jain ]]> Title: Toward Multimodal Intelligence: Perception, Memory & Any-Horizon Reasoning

Jitesh Jain

Ph.D. Student in Computer Science

School of Interactive Computing

Georgia Institute of Technology

https://praeclarumjj3.github.io/

Date: May 22, 12:00 - 2:00 PM EST

Location: Coda 1215

Zoom: https://gatech.zoom.us/j/9814414092?pwd=WnpkTjNhRHhYQlNzZGxTTW9SWmtJdz09

Committee:

Dr. Humphrey Shi (Advisor) - School of Interactive Computing, Georgia Institute of Technology

Dr. Zsolt Kira - School of Interactive Computing, Georgia Institute of Technology

Dr. Kartik Goyal - School of Interactive Computing, Georgia Institute of Technology

Dr. Judy Hoffman - Donald Bren School of Information and Computer Sciences, University of California, Irvine

Dr. Jianwei Yang - Member of Technical Staff, xAI

Abstract: Multimodal large language models have made impressive strides in language understanding and reasoning yet struggle with abilities that come naturally to humans: perceiving objects in cluttered scenes, remembering context across long interactions, and reasoning adaptively over extended time horizons. In this thesis, we argue that overcoming this gap requires integrating three capabilities that remain weak in current systems: visual perception, multimodal memory, and any-horizon reasoning.

We begin by identifying that vision-language models fail at basic object-level perception and show that incorporating structured segmentation and depth signals as visual inputs significantly improves performance. Second, we improve spatial reasoning more fundamentally by distilling expert visual knowledge into the model's internal representations during pre-training, with no added cost at inference. Third, we build a multimodal agent with a graph-structured cognitive memory that enables efficient retrieval of multimodal context across long conversations. Finally, we propose an adaptive agent system to reason over long videos, addressing the challenges of scalable data collection, system design and training recipe for open-ended video understanding.

]]> Toward Multimodal Intelligence: Perception, Memory & Any-Horizon Reasoning

]]> <![CDATA[]]> 221981 1788 102851