**Advisors:** Dr. Yajun Mei and Dr. Brani Vidakovic

**Committee members:**

Dr. Xiaoming Huo

Dr. Paul Griffin (School of Industrial Engineering, Purdue University; Adjunct Professor, ISyE, Georgia Tech)

Dr. Ofer Sadan (School of Medicine, Emory University)

**Date and Time: **Monday, May 06, 2019, 1:00 PM

**Location: **Groseclose 226A

**Abstract:**

Data science is playing an increasingly important role in improving public health. Data used for public health studies are in various types, and such variety provides opportunities and challenges for statisticians to impact public health in many ways. This dissertation aims to develop data-driven, efficient statistical and machine learning techniques in some modern real-world applications. We consider four different contexts: (i) Visual impairment classification based on noisy high-frequency pupillary response behavior data collected from human-computer interaction, (ii) breast cancer diagnosis using image data from plain Xray, (iii) personalized screening for sepsis disease based on regularly measured longitudinal biomarkers, (iv) prediction on the overall burden of postoperative complications using laboratory measurements.

In Chapter 1, we study the robust estimation of Hurst exponent from one dimensional high-frequency, time series data. High-frequency data from various sources often possess hidden patterns that cannot be explained by basic descriptive statistics, traditional statistical models, or global trends. For those complex high-frequency data, Hurst exponent becomes a powerful tool to detect the muted change patterns. Hurst exponent quantifies the long memory, regularity, self-similarity, and scaling in a time series. In this chapter, we propose robust estimators of Hurst exponent based on non-decimated wavelet transforms, and applied our methods to Pupillary response behavior (PRB) data to extract the Hurst exponent and then use it as a predictor to classify individuals with different degrees of visual impairment. At high level, the basic idea of all wavelet-based methods to estimate Hurst exponent is to explore the fact that Hurst exponent is linearly correlated to wavelet coefficients on the log-scale. In this study, we propose a general trimean estimator that balances the tradeoff between median and extreme values and applied it on wavelet coefficients before correlating with Hurst exponent. By doing this, we are able to lessen the effects of outliers, thus achieving the robust estimation of Hurst exponent.

In Chapter 2, we extend the robust estimation of Hurst exponent to two dimensional images, and then apply the proposed method to mammograms to diagnose breast cancer. In the literature, researchers have developed many statistical and machine learning methods to do image classification, but most of them are black-box methods. In this chapter, we propose to use fractional Brownian motion (fBm) to model mammogram image, develop a robust estimator of Hurst exponent from two-dimensional fBm models based on non-decimated wavelet transforms, and then predict breast cancer using the extracted Hurst exponent. This allows us to use the underlying degree of self-similarity as a discriminatory descriptor to classify mammograms to benign and malignant. In addition, as compared to one-dimensional case, it is more complicated for the two dimensional images because the within level correlation of non-decimated wavelet coefficients is defined in two dimensional space and violates the independence assumption. Our main idea is to consider a symmetric random sampling technique to solve for such correlation issue. Unlike the hard-to-interpret machine learning methods, our method helps to summarize the common features from the cancerous images and mimics the way how physicians make decisions in practice.

Chapter 3 studies the personalized screening for Sepsis disease. Sepsis is a life-threatening complication of infection. In 2016, a scoring criterion called quick Sequential Organ Failure Assessment (qSOFA) was proposed by a group of experts as a screening criterion for sepsis. To be more concrete, if at least two of the following three conditions are satisfied, then an alarm will be set for a patient and physicians will conduct laboratory tests to further assess sepsis: 1) systolic blood pressure is <= 100 mm Hg; 2) respiratory rate is >= 22 breaths/min; 3) alteration in mental status ( GCS score is less than 15). However, qSOFA does not perform well in practice, with very low sensitivity. Part of the reasons is that qSOFA uses constant thresholds for the biomarkers in regardless of patients' baseline information. Hence, we aim to improve qSOFA by developing a knowledge-based machine-learning method to self-learn the personalized thresholds that depend on patients' baseline information. The main idea to model the personalized threshold as functions of those demographic variables, and then use a boosting-based weighted exponential loss function to learn the personalized thresholds for efficient screening of sepsis. Our method yields efficient personalized monitoring, appropriate subject-specific intervention in early stages of sepsis, and thus a significant reduction of the mortality rate.

In Chapter 4 of the dissertation, our motivating example is to predict the overall burden of postoperative complications based on a real data set consisting of 206 adult patients who stayed in the Clinic for Digestive Surgery, Clinical Center of Serbia in Belgrade between November 2016 and October 2017. Recently, a critical scale called Comprehensive Complication Index (CCI) was developed by a group of experts to capture the overall burden of complications in the postoperative period. However, the CCI has several disadvantages: 1) It is calculated from a complicated procedure that requires the physicians and nurses to make records of every detail during patients' hospitalization, and is not practical for everyday use; 2) it can be calculated only retrospectively, when the hospitalization is finished, it can only reflect the results of perioperative treatment but cannot be used as a measure of patients’ current status. In this chapter, we develop a zero-and-one inflated beta regression model to predict the CCI values based on patients' clinical covariates, and also propose to estimate the unknown sparse coefficient vectors by maximizing the penalized log-likelihood function. Our proposed method not only can achieve a good prediction on CCI but also can select important clinical covariates leading to postoperative complications. This allows us to simplify the calculation of CCI and make it prospective.

]]>