event

PhD Defense by Chaofan Huang

Primary tabs

Title: Novel Experimental Design Techniques for Data Science

 

Date: April 8th, 2024

Time: 9-10:30 am ET

Location: Groseclose 402 or Click here to join the meeting (Teams Meeting ID: 296 750 335 855, Passcode: zPBQyQ)

 

Committee

Dr. Roshan V. Joseph (Advisor), H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology

Dr. C. F. Jeff Wu, H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology

Dr. Matthew Realff, School of Chemical and Biomolecular Engineering, Georgia Institute of Technology

Dr. Enlu Zhou, H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology

Dr. Shihao Yang, H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology

 

Abstract

 

Experimental design is a fundamental area of statistics, with many fascinating techniques developed in recent years. Many of these methodologies possess broader applications beyond experimental design-related problems but are not yet explored. This thesis presents novel applications of experimental design techniques for data science problems, with a focus on sampling, optimization, and machine learning.

 

Chapter 1 and Chapter 2 integrate principles from optimal space-filling design into the weighted resampling methods, an integral part of sequential sampling, survey sampling, etc. The most straightforward approach is to draw sample independently with replacement according to their weights. However, this may result in clustered resamples that provide duplicated information and fail to cover certain regions of the sample space. Chapter 1 introduces a novel deterministic weighted sampling scheme known as the Importance Support Points (ISP) resampling. ISP resampling selects optimal resamples that not only best represent the weighted samples in terms of energy distance but also ensures space-fillingness. We incorporate ISP resampling into sequential sampling methods, and demonstrate its empirical improvement over the existing weighted resampling techniques. However, the quadratic complexity of ISP computation restricts its practicality in large data settings. To address this shortcoming, Chapter 2 presents Weighted Twinning, a nearest-neighbor based heuristic algorithm that is orders of magnitude faster for computing ISP, making it applicable to a broader class of problems.

 

Chapter 3 explores another deterministic resampling method based on the minimum energy design (MinED). MinED resampling also aims to find the set of space-filling resamples that are representative for the target distribution. The effectiveness of MinED resampling is illustrated by its integration with sequential sampling for constructing space-filling design in highly constrained regions. Extensive simulation results are provided to demonstrate the improved performance over existing state-of-the-art techniques. 

 

Chapter 4 presents a novel sequential design technique for efficiently calibrating (optimizing) the parameters of a functional output model. The proposed algorithm improves over the standard Bayesian optimization by (i) utilizing the generalized chi-square distribution as a more appropriate predictive distribution for the squared distance objective function in the calibration problems, and (ii) applying functional principal component analysis to reduce the dimensionality of the functional response data, which allows for efficient approximation of the predictive distribution and subsequent computation of the expected improvement acquisition function.

 

Finally, Chapter 5 applies the variance-based global sensitivity analysis  for factor importance computation, one of the fundamental problems in statistics and machine learning. Many existing works focus on the model-based importance, but an important feature in one learning algorithm may hold little significance in another learning algorithm. Hence, a factor importance measure ought to characterize the feature's predictive potential without relying on a specific prediction algorithm. To bypass the modeling step, the equivalence between predictive potential and total Sobol' indices is drawn, and a consistent estimator using only the noisy data is proposed. Integrating with forward selection and backward elimination gives rise to a novel algorithm for factor importance ranking and selection. The effectiveness of the algorithm is demonstrated in simulations and real world examples.

 

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:03/18/2024
  • Modified By:Tatianna Richardson
  • Modified:03/18/2024

Categories

Keywords

Target Audience