event

PhD Defense by Yanbo Xu

Primary tabs

Title: Robust Representation Learning and Real-Time Serving in Deep Models for Health Time Series

 

Date: Friday, March 31, 2023

Time: 11:00 – 13:00 (EST)

Location: CODA C1315 Grant Park

Virtual Location:  https://gatech.zoom.us/j/7495954088

 

 

Committee

1. Dr. Chao Zhang, (Advisor, School of Computational Science and Engineering, College of Computing)
2. Dr. Alexey Tumanov (Co-Advisor, School of Computer Science, College of Computing)
3. Dr. Yao Xie (H. Milton Stewart School of Industrial and Systems Engineering)
4. Dr. Aditya Prakash (School of Computational Science and Engineering, College of Computing)

5. MD. Kevin Maher (Emory University School of Medicine and Children Hospital of Atlanta)

 

Student Name Yanbo Xu

Machine Learning PhD Student

School of Computational Science and Engineering
Georgia Institute of Technology

 

Abstract

Modern Electronic Health Record (EHR) systems provide large amount of data that enables machine learning (ML) researchers to develop ML methods to improve healthcare. However, development in a clinical setting presents unique challenges in ML model training and serving. For example, EHR data are usually captured from multiple sources over time in noisy environments such as in Intensive Care Units (ICUs). As a result, data are generated in the form of time series with multiple issues including heterogeneity, missingness, irregulrity, etc. Although ML methods such as deep neural networks have been successfully developed for many predictive health tasks, improvements are still in need for learning robust and efficient predictive models to harness such multi-modal, noisy, and massive time series data.

 

In this dissertation, we aim to tackle the following fundamental problems in developing ML models for health time series:

  • Multiple modalities in time series. Clinical time series are often generated on different devices at different frequencies. A typical ICU monitoring dataset can contain continuous signals like electrocardiogram (ECG), evenly charted tabular data like vital signs, and sparse discrete events like lab tests and medications. Simple binning methods on values can reduce rich information in dense data and mask important information in sparse data. To address this, we design an efficient ensembling algorithm for reweighting the models that are individualized for each data modality. Then for better capturing the underlying heterogeneity behind the multimodal data, we further design individualized embeddings per modality and fit self-attention Transformer on top of them for more robustly fusing the EHR time series.
  • Missing observations at random time steps. Data collection is often noisy in HER systems. Missing data or mis-timestamped data happens due to random device disconnections, patient’s body movement, human errors, etc. Models without considerations on input missingness and noises can lead to overfit and biased predictions. We incorporate stochastic differential equations into spatial temporal modeling, enabling imputations on randomly missing fields in structural time series with support of uncertainty quantification. We further propose score-based diffusion models for generating missing data and denoising the observed discrete event sequences.
  • Large unlabelled data available across different sites. True labels are expensive to obtain in clinical applications. Although input signals can be easily collected in EHR systems, many labels of interest still require manual annotations and data reviews from clinical experts retrospectively. Thus large amount of unlabelled data, which can be collected across several different hospitals, become available to researchers whereas only a few are labelled. To address this challenge, we investigate self-supervised learning in deep models and learn robust representations from the large unlabelled data that can be later adapted and fine tuned for downstream tasks.
  • Timely serving in resource-limited systems. In clinical environments such as ICUs, care practitioners need to make appropriate decision in a timely manner. Thus far deep learning models have been mainly developed for increasing prediction accuracy in heathcare, but few of them consider whether or how they can be served in real time given a resource constrained deployment environment. To bridge the gap, we design cost-aware prediction pipelines that can cascade to differently sized models for balancing between prediction accuracy and serving cost.

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:03/24/2023
  • Modified By:Tatianna Richardson
  • Modified:03/24/2023

Categories

  • No categories were selected.