event

PhD Defense by Tianyi Liu

Primary tabs

Dear faculty members and fellow students,

 

You are cordially invited to attend my thesis defense.

  

Thesis Title: Theoretical Analysis of Stochastic Gradient Descent in Stochastic Optimization

 

Advisors:

Dr. Enlu Zhou, School of Industrial and Systems Engineering, Georgia Tech

Dr. Tuo Zhao, School of Industrial and Systems Engineering, Georgia Tech

 

Committee members:

Dr. Alexander Shapiro, School of Industrial and Systems Engineering, Georgia Tech

Dr. Robert D. Foley, School of Industrial and Systems Engineering, Georgia Tech

Dr. Zhengyuan Zhou, Stern School of Business, New York University

 

Date and Time: 10:00 am (EST), Friday, April 9th, 2021

  

Meeting URL:   https://bluejeans.com/287756905

Meeting ID:  287 756 905 (BlueJeans)

  

Abstract:

Stochastic Gradient Descent (SGD) type algorithms have been widely applied to many

stochastic optimization problems, such as machine learning. Despite its empirical success,

there is still a lack of theoretical understanding of convergence properties of SGD and its

variants. The major bottleneck comes from the highly nonconvex optimization landscape

and the complicated noise structure. This thesis aims to provide useful insights on the good

performance of SGD type algorithms through theoretical analysis with the help of diffusion

approximation and martingale theory. Specifically, we answer the following questions:

 

Chapter 1: What is the effect of Momentum in nonconvex optimization? We propose to

analyze the algorithmic behavior of Momentum Stochastic Gradient Descent (MSGD) by

diffusion approximation for general nonconvex optimization problems. Our study shows

that the momentum helps escape from saddle points, but hurts the convergence within the

neighborhood of optima (if without the step size annealing or momentum annealing). Our

theoretical discovery partially corroborates the empirical successes of MSGD in training

deep neural networks.

 

Chapter 2: How does noise in SGD help the algorithm avoid spurious local optima?

We answer this question through a simple two-layer convolutional neural network model,

which has a spurious local optimum and a global optimum. Our theory shows that perturbed

gradient descent and perturbed mini-batch stochastic gradient algorithms in conjunction

with noise annealing is guaranteed to converge to a global optimum in polynomial time

with arbitrary initialization. This implies that the noise enables the algorithm to efficiently

escape from the spurious local optimum.

 

Chapter 3: How does noise in SGD help select optima that have good generalization

performance? We further investigate the role of noise when multiple global optima exist by

considering nonconvex rectangular matrix factorization problem, which has infinitely many

global minima due to rotation and scaling invariance. Gradient descent (GD) can converge

to any optimum, depending on the initialization. In contrast, we show that a perturbed

form of GD with an arbitrary initialization converges to a global optimum that is uniquely

determined by the injected noise. Our result implies that the noise imposes implicit bias

towards certain optima.

 

Chapter 4: Does reusing past samples in SGD help improve the efficiency in simulation

optimization? We consider a special type of stochastic optimization problem, simulation

optimization. The main challenge of simulation optimization is the limited simulation

budget because of the high computational cost of simulation experiments. One approach

to overcome this challenge is to reuse simulation outputs from previous iterations in the

current iteration of the optimization procedure. However, due to the dependence among iterations,

simulation replications from different iterations are not independent, which leads

to the lack of theoretical justification for the good empirical performance. We fill this gap

by theoretically studying the stochastic gradient descent method with reusing past simulation

replications. We show that reusing past replications does not change the convergence

of the algorithm, which implies the bias of the gradient estimator is asymptotically negligible.

Moreover, we justify that reusing past replications reduces the variance of gradient

estimators around local optima, which implies that the algorithm can achieve faster convergence.

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:03/29/2021
  • Modified By:Tatianna Richardson
  • Modified:03/29/2021

Categories

Keywords