<![CDATA[PhD Defense by Nikolai Warner]]>

690800 event 1781715061 1781715105 <![CDATA[PhD Defense by Nikolai Warner]]> Title: Improving Out-of-Distribution Generalization in Human-Centric Multimodal Vision

Date: Monday, June 22, 2026

Time: 1:00 - 3:00 PM ET

Location: Coda C0915 Atlantic + Remote (https://teams.microsoft.com/meet/250438047509225?p=kzLrPnM2Ap8Ny0Dq9t)

Meeting ID: 250 438 047 509 225 | Passcode: 2Tn3Us6F

Nikolai Warner

Robotics Ph.D. Candidate

George W. Woodruff School of Mechanical Engineering

Georgia Institute of Technology

Committee

Dr. Irfan Essa (Advisor) - School of Interactive Computing, Georgia Institute of Technology

Dr. Thomas Ploetz - School of Interactive Computing, Georgia Institute of Technology

Dr. Zsolt Kira - School of Interactive Computing, Georgia Institute of Technology

Dr. Judy Hoffman - School of Interactive Computing, Georgia Institute of Technology

Dr. Apaar Sadhwani - Amazon

Abstract

Despite steady in-distribution progress on human-centric vision tasks and the emergence of powerful foundation models, in-the-wild and out-of-distribution performance still lags. This dissertation studies four such tasks (interactive segmentation, non-rigid image editing, 3D human pose estimation, and motion-language alignment) and traces their out-of-distribution gap to two distinct failures: a signal-side failure, where the input modality is ill-posed for the task, and a noise-side failure, where the supervision channel carries distribution-specific nuisance. On the signal side, DAISeg enriches click-conditioned segmentation with an open-vocabulary saliency channel (from +3 mIoU on seen classes up to +10.5 on unseen, beating SAM under text-conditioned clicks), and AugLift hands the 2D-to-3D lifter a per-joint depth lower bound (−8.9% OOD MPJPE across four architectures, plus cross-dataset SOTA when combined with DG techniques). On the noise side, IPC-Edit constructs supervision that had no public equivalent, filtering and composing three noisy proxies into a 13.5K-pair corpus for identity-preserving non-rigid editing (68.5% identity preservation vs. 61%), while MoCHA denoises supervision that already exists, distilling an LLM canonicalization operator that strips annotator style from captions and setting a new cross-distribution SOTA (T2M R@1 from 13.74 to 26.59, +94%).

]]> see below

]]> <![CDATA[]]> 221981 1788 100811