Thesis Title: Some Computationally Efficient Methods in Statistics and Their Applications in Parameter Estimation and Hypotheses Testing

Advisor: Dr. Xiaoming Huo

Committee members:

Dr. Jeff Wu

Dr. Yajun Mei

Dr. Yao Xie

Dr. Vladimir Koltchinskii (School of Mathematics)

Date and Time: **Friday, 5/5/2017, 11:30 am**

Location: **Groseclose 402 Advisory Boardroom**

Abstract

Parameter estimation and hypotheses testing are fundamental problems in statistics. Many existing methods have been developed for the problems with moderate amount of data. Unfortunately, some of those methods could be computationally costly or even infeasible when the volume of data is high. This thesis is an attempt to fulfill the needs for computationally efficient methods in statistics.

The first part of this thesis focuses on distributed statistical inference, which has recently attracted enormous attention. While many existing work focuses on the averaging estimator, we propose a one-step approach to enhance a simple-averaging based distributed estimator. We derive the corresponding asymptotic properties of the newly proposed estimator. We find that the proposed one-step estimator enjoys the same asymptotic properties as the centralized estimator. The proposed one-step approach merely requires one additional round of communication in relative to the averaging estimator; so the extra communication burden is insignificant. In finite sample cases, numerical examples show that the proposed estimator outperforms the simple averaging estimator with a large margin in terms of the mean squared errors. A potential application of the one-step approach is that one can use multiple machines to speed up large scale statistical inference with little compromise in the quality of estimators. The proposed method becomes more valuable when data can only be available at distributed machines with limited communication bandwidth.

The second part is a statistically and computationally efficient test of independence based on distance covariance and random projections. As we know, test of independence plays a fundamental role in many statistical techniques. Among the nonparametric approaches, the distance-based methods have numerous advantages, comparing with many other alternatives. A known limitation of the distance-based method is that its computational complexity can be high. In general, when the sample size is $n$, the order of computational complexity of a distance-based method, which typically requires computing of all pairwise distances, can be $O(n^2)$. Recent advances have discovered that in the univariate cases, a fast method with $O(n \log n)$ computational complexity and $O(n)$ memory requirement exists. In this paper, we show the potential of random projection in converting the multivariate problems into multiple univariate ones. As an immediate consequence, we develop a novel test of independence method based on random projections and distance covariance. We name our method a Randomly Projected Distance Covariance (RPDC), which achieves nearly the same power as the state-of-the-art distance-based approach, works in the multivariate cases, and enjoys the $O(n K \log n)$ computational complexity and $O(\max\{n,K\})$ memory requirement, where $K$ is the number of random projections. The empirical results suggest that fixed number of random projections suffices. The statistical theoretical analysis takes advantage of some techniques on random projection, which are rooted in contemporary machine learning. Numerical experiments demonstrate the efficiency of the proposed method, in relative to several competitors.

In the third part, we apply the technique of random projections on energy statistics to develop a fast algorithm for energy statistics and derive an efficient two-sample test. A common disadvantage in existing distribution-free two-sample testing approaches is that the computational complexity could be high. Specifically, if the sample size is $N$, the computational complexity of those two-sample tests is at least $O(N^2)$. In this paper, we develop an efficient algorithm with complexity $O(N \log N)$ for computing energy statistics in univariate cases. For multivariate cases, we introduce a two-sample test based on energy statistics and random projections, which enjoys the $O(K N \log N)$ computational complexity, where $K$ is the number of random projections. We name our method for \emph{multivariate} cases as Randomly Projected Energy Statistics (RPES). We can show that RPES achieves nearly the same test power with energy statistics both theoretically and empirically. Numerical experiments also demonstrate the efficiency of the proposed method against the competitors.