**Title**: Statistical learning with regularizations: theory and applications

**Advisors**: Dr. Xiaoming Huo, Dr. Nicoleta Serban

**Committee Members**:

Dr. Brani Vidakovic (ISyE)

Dr. Joel Sokol (ISyE)

Dr. Vladimir Koltchinskii (School of Mathematics)

**Date and Time**: Friday, May 3rd, 10:00 am

**Location: **ISyE Groseclose 403

**Abstract**:

This thesis contributes to the area of statistical learning with regularization and applications, which has been popular for sparse estimation and function estimation in many areas such as signal/image processing, statistics, bioinformatics and machine learning.

Our study helps (i) unify the high-dimensional sparse estimation with non-convex penalty; (ii) prove the asymptotical optimality of high-order Laplacian regularization in function estimation; (iii) improve the performance of the composite fuselage assembly process by using sparsity penalized $\ell_\infty$ based linear model; (iv) identify the census tracts where children have limited access to preventive dental care.

In this thesis, we have four main works. In Chapter 1, under the linear regression framework, we study the variable selection problem when the underlying model is assumed to have a small number of nonzero coefficients (i.e., the underlying linear model is sparse). We propose to use the difference-of-convex (DC) functions to unify the non-convex penalties in the literature for sparse estimation. Under the DC framework, directional-stationary (d-stationary) solutions are considered, and they are usually not unique.

In this chapter, we show that under some mild conditions, a certain subset of d-stationary solutions in an optimization problem (with a DC objective) has some ideal statistical properties: namely, asymptotic estimation consistency, asymptotic model selection consistency, asymptotic efficiency. This work shows that DC is a nice framework to offer a unified approach to these existing work where non-convex penalty is involved. Our work bridges the communities of optimization and statistics.

In Chapter 2, we propose a function estimation method using the high-order Laplacian regularization. Graph Laplacian based regularization has been widely used in learning problems to take advantage of the information on the geometry towards the marginal distribution. In this chapter, we consider the high-order Laplacian regularization, whose empirical (i.e., sample) version takes the form of ${\bf f}^T {\bf L}^m {\bf f}$ with ${\bf L}$ being the graph Laplacian matrix of the sample data, and provide the theoretical foundations in the non-parametric setting. We show that nearly all good asymptotic properties of the existing state-of-the-art approaches are inherited by the Laplacian-based smoother. Specifically, we prove that as the sample size goes to infinity, the expected mean squared errors (MSE) is of order $O(n^{-\frac{2m}{2m+d}})$, which is the {\it optimal convergence rate} in a setting of nonparametric estimation \cite{Stone82}, where $m$ is the order of the Sobolev semi-norm used in the regularization, and $d$ is the intrinsic dimension of the domain. Besides, we propose a {\it generalized cross validation} (GCV) approach to choose the penalty parameter $\lambda$, and we establish its {\it asymptotical optimality} guarantee.

In Chapter 3, we study the fuselage assembly problem using sparse learning theories. Natural dimensional variabilities of incoming fuselages affect the assembly speed and quality of fuselage joins in composite fuselage assembly process. Thus, shape control is critical to ensure the quality of composite fuselage assembly. In practice, the maximum gap between the two fuselages plays the key role for assembly. In this work, we consider the $\ell_\infty$ based linear regression, which is lack of study in statistics but critical for optimal shape control in fuselage assembly. We mainly study the $\ell_\infty$ model under the framework of high-dimensional sparse estimation, where we use the $\ell_1$ penalty to control the sparsity of the resulting estimator. Estimation error of the $\ell_1$ regularized $\ell_\infty$ linear model is derived, which meets the upper-bound in the exiting literature. Finally, we use numerical studies for fuselage control to verify the advantages of $\ell_\infty$ based regression.

In Chapter 4, we compared access to preventive dental care for low-income children eligible for public dental insurance to children with private dental insurance and/or high family income ($>$400\% of the federal poverty level) in Georgia and the impact of policies towards increasing access to dental care for low-income children.

Specifically, we used multiple sources of data (e.g., US Census, Georgia Board of Dentistry) to estimate measures of preventive care access in 2015 for children, aged 0 to 18 years. Measures included met need, scarcity of dentists, and one-way travel distance to a dentist at the census tract level. We used an optimization model to estimate access, quantify disparities and evaluate policies. We find that about 1.5 million children were eligible for public insurance, and 600,000 had private insurance and/or high family income. Across census tracts, average met need was 59\% for low-income children and 96\% for the high-income children; for rural census tracts, these values were 33\% and 84\%, respectively. The average travel distance for all census tracts was 3.71 miles for high-income/insured children and 17.16 miles for low-income children; for rural census tracts, these values were 11.55 and 32.91 miles, respectively. Met need significantly increased and travel distance decreased for modest increases in provider acceptance of Medicaid eligible children. In order to achieve 100\% met need, 80\% provider participation rate would be required. We conclude that across census tracts, high-income children had notably higher access than low-income children. Identifying these tracts could result in more efficient allocation of public health dental resources.