event

PhD Defense by Zaiwei Chen

Primary tabs

Title: A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning Algorithms

Date: 04/07/2022
Time: 1:00 - 2:30 pm EST
Location: Groseclose 402, or virtually at https://gatech.zoom.us/j/9849731860?pwd=K29BSStGekgvYkxlK1ZRZVp1QUlLdz09 (Meeting ID: 984 973 1860 Passcode: 7n46MA).
Student Name: Zaiwei Chen
Machine Learning PhD Student
School of Industrial & Systems Engineering
Georgia Institute of Technology

Committee
1 Dr. Siva Theja Maguluri (Advisor)
2 Dr. John-Paul Clarke (Co-advisor)
3 Dr. Justin Romberg
4 Dr. Ashwin Pananjady
5 Dr. Benjamin Van Roy

Abstract: In this thesis, we develop a unified Lyapunov approach for establishing finite-sample guarantees of reinforcement learning (RL) algorithms. Since most of the RL algorithms can be modeled by stochastic approximation (SA) algorithms under Markovian noise, we first provide a Lyapunov framework for analyzing Markovian SA algorithms. The key idea is to construct a novel Lyapunov function (called generalized Moreau envelop) to capture the dynamics of the corresponding SA algorithm, and establish a negative drift inequality, which then can be repeatedly used to derive finite-sample bounds. We use our SA results to design RL algorithms and perform finite-sample analysis. Specifically, for tabular RL, we establish finite-sample bounds for Q-learning, variants of on-policy TD-learning algorithms such as n-step TD and TD(\lambda), and off-policy TD-learning algorithms such as Retrace(\lambda), Q^\pi(\lambda), and V-trace, etc. As by-products, we provide theoretical insight into the problem of efficiency of bootstrapping in on-policy TD-learning, and demonstrate the bias-variance trade-off in off-policy TD. For RL with linear function approximation, we design convergent variants of Q-learning and TD-learning in the presence of the deadly triad, and derive finite-sample guarantees. The TD-learning algorithm was later used in a general policy-based framework (including approximate policy iteration and natural policy gradient) to eventually find an optimal policy of the RL algorithm with an O(\epsilon^{-2}) sample complexity.

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:04/04/2022
  • Modified By:Tatianna Richardson
  • Modified:04/04/2022

Categories

Keywords