PhD Defense by Bo Xie

Ph.D. Thesis Defense Announcement

Title: Algorithms and Analysis for Non-convex Optimization Problems in Machine Learning

Bo Xie

School of Computational Science and Engineering

Georgia Institute of Technology

Date and Time : Tuesday, April 18, 1:30PM - 3:30PM Eastern Time

Location : KACB Room 2100

Committee :

Dr. Le Song (Advisor), School of Computational Science and Engineering, Georgia Institute of Technology

Dr. Santosh Vempala, School of Computer Science, Georgia Institute of Technology

Dr. Hongyuan Zha, School of Computational Science and Engineering, Georgia Institute of Technology

Dr. Byron Boots, School of Interactive Computing, Georgia Institute of Technology

Dr. Animashree Anandkumar, Department of Electrical Engineering and Computer Science, University of California Irvine

Abstract :

In this thesis, we propose efficient algorithms and provide theoretical analysis through the angle of spectral methods for some important non-convex optimization problems in machine learning. Specifically, we focus on two types of non-convex optimization problems: learning the parameters of latent variable models and learning in deep neural networks.

I. Spectral methods for latent variable estimation

Learning latent variable models is traditionally framed as a non-convex optimization problem through Maximum Likelihood Estimation (MLE). For some specific models such as multi-view model, we can bypass the non-convexity by leveraging the special model structure and convert the problem into spectral decomposition through Methods of Moments (MM) estimator.

In this thesis, we propose a novel algorithm that can flexibly learn a multi-view model in a non-parametric fashion. It estimates the conditional distributions of the latent variable model by decomposing operators in a functional space.

We then demonstrate one application of spectral methods in the task of DNA motif prediction. By modeling the DNA sequence as a HMM and learn the representation with spectral methods, we can efficiently compute the posterior distribution of the latent variables without being trapped in local optima. We then feed these latent features into a classifier and achieve superior performance than that obtained by hand-crafted features.

II. Scalable algorithms to solve nonlinear spectral methods

One obstacle of applying the nonparametric spectral methods to large datasets is that it scales at least quadratically with the number of data points. To overcome the issue, we propose two versions of scalable nonlinear spectral algorithms. One version, called doubly stochastic gradient descent, uses sampling to approximate two expectations in the problem, and it achieves better balance of computation and statistics by adaptively growing the model as more data arrive. Although it is still a non-convex optimization problem, the algorithm is guaranteed to converge at the rate of $O(1/t)$ where $t$ is the number of iterations.

Another version is the distributed kernel principle component analysis (KPCA) algorithm. The proposed algorithm estimates leverage scores to only sample a few representative data points from the whole dataset. To reduce communication overhead, the leverage scores are approximated by sketched data points. By carefully controlling the balance between communication and approximation error, we obtain a distributed algorithm that is nearly optimally communication efficient.

III. Analysis of neural network learning

Learning with neural networks is a difficult non-convex problem while simple gradient-based methods achieve great success in practice. In this part of the thesis, we try to understand the optimization landscape of learning one-hidden-layer networks with Rectified Linear (ReLU) activation functions. By directly analyzing the structure of the gradient, we can show neural networks with diverse weights have no spurious local optima. This partly explains the empirical success of gradient descent since a stationary point leads to a global optimum under diversity conditions on the neural weights.

Inspired by the analysis, we introduce semi-random units, which sit between fully adjustable ReLU units and random features. Semi-random units possess nice theoretical properties despite the non-convex nature of the optimization problem. In particular, networks with such units have no spurious local optima and gradient descent converges to the global optimum. Moreover, semi-random features only use slightly more units to reach comparable performance as ReLU and they use much fewer units compared with random features.

Best,

Media

No media selected

Summary

Details

Tuesday

Apr 18 2017

02:30pm - 04:30pm

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow status: Published
Created by: Tatianna Richardson
Created: 04/11/2017
Modified By: Tatianna Richardson
Modified: 04/11/2017

Mercury (Hg)