Dear Faculty Members and Fellow Students,

You are cordially invited to attend my thesis defense.

Thesis Title: Non Parametric Statistical Modeling using Wavelets: Theory and Methods

Advisor:

Dr. Brani Vidakovic

Committee Members:

Dr. Yao Xie

Dr. Dave Goldsman

Dr. Justin Romberg

Dr. Kamran Paynabar

Date and Time: Friday, January 18, 2019, 1:00 PM

Location: ISyE Main 126

Abstract:

This dissertation aims to contribute to the existing theory and methodologies in the field of Data Science, with focus in the wavelet-based non-parametric statistical domain due to its robustness to prior modeling assumptions and flexibility of application in many different contexts.

Considering this objective, four methodologies based on wavelets are introduced and analyzed. Applications such as survival density estimation in the presence of randomly censored data, non-linear additive regression and multiscale correlation analysis are covered, and each topic is studied from both a theoretical and pragmatic perspectives. A theoretical foundation for each proposed method is developed, and then applications are illustrated using simulations studies and real data sets.

My Thesis is structured in six Chapters, each containing the following topics:

In Chapter 1, the motivation for the use of wavelets is provided, and general definitions and results involving their use in statistics are introduced. This aims to the construction of a minimum theoretical foundation over which the methodologies introduced in the subsequent Chapters are built upon.

In Chapter 2, the density estimation problem is studied. A non-parametric estimator for probability densities in the presence of randomly censored data is introduced, and its statistical properties are analyzed. A linear density estimator using wavelet coefficients that are fully data-driven is proposed. This estimator is shown to be asymptotically unbiased, with global mean square consistency. In addition, its performance is evaluated using different exemplary distributions, with different sample sizes and censoring schemes. On top of that, some implementation recommendations and remarks are provided, providing guidance to future practitioners interested in applying the proposed technique.

In Chapter 3, the problem of non-parametric regression for additive models is investigated, introducing a novel approach using orthogonal projections onto linear functional subspaces. These regression models are useful in the analysis of responses determined by non-linear relationships with multivariate predictors, which provides more flexibility and generality that the traditional multi-dimensional linear regression model. A mean-square consistent estimator based on an orthogonal projection onto a multiresolution space using empirical wavelet coefficients is proposed, and its convergence rates are analyzed when the set of unknown functions can be characterized by a known smoothness index. These results are obtained without the assumption of an existing equispaced design, a condition that is typically assumed in most wavelet-based procedures. In addition, some qualitative comparison with existing methodologies is provided, illustrating the potential estimation capabilities of the proposed methodology.

In Chapter 4, the additive regression problem is analyzed from a different viewpoint: the classic least squares solutions using an orthogonal wavelet basis is proposed and its theoretical properties are analyzed when the design matrix can be assumed to satisfy certain dimensionality conditions. This estimation methodology is based on periodic orthogonal wavelets on the interval [0,1]. A strongly consistent estimator (with respect to the L2 norm) is introduced, leading to optimal convergence rates up to a logarithmic factor, independent of the dimensionality of the problem. Similarly, as in the previous Chapter, these results are obtained without the assumption of an equispaced design for the predictors, which shows the flexibility of wavelets for statistical applications and the power of the least squares approach. These theoretical results are further complemented with a simulation study and the application of the proposed method on a real-life data set, enabling its comparison with several machine learning algorithms in a real-life scenario previously published in the literature.

In Chapter 5, an alternative approach for the additive regression problem using Bayesian hierarchical Normal-Inverse-Gamma (NIG) structures is introduced. First, a robust and simple model that reduces to an l2-regularized regression model is proposed and implemented. The theoretical derivations of the estimator and predictive distribution are provided, and the hyper-parameter selection is discussed. Furthermore, an implementation algorithm based on a back fitting approach is proposed and its performance is studied via simulation. Secondly, this model is extended to a mixture of NIG in the expansion coefficients, improving the capacity of the model to adapt to different degrees of smoothness in the unknown functions. Closed form solutions for the Bayes estimator are derived and its structure is discussed. Next, a special case of the previous model is analyzed: a point-mass contaminated NIG model. This modeling structure aims to enforce more sparse estimation of the functions in the model, thus providing a more adaptive methodology for irregular functions. Finally, the applicability of these methods is illustrated via a simulation study, and its performance is compared to the least squares approach, and a method denominated AMLET introduced by Sardy and Tseng (2004).

Finally, In Chapter 6, the problem of correlation analysis is studied from a multiscale perspective via the application of Discrete Wavelet transformations (DWT). A systematic methodology that uses the linearity and orthogonality of the DWT is used to decompose a sample correlation into a weighted sum of scale-wise correlations that have a special additive structure and enable the extraction of information about possible linear relationships that are hidden otherwise. In addition, some of the theoretical properties of the expansion coefficients are analyzed for stationary processes, and a robust statistical test based on a condition number is introduced, comparing its performance with popular parametric and non-parametric tests when signals are generated from AR1, MA1 or ARMA(1,1) processes. Finally, a simple application use case is provided, showing its usability in the context of Data Analytics for correlated data.