event

PhD Proposal by Chao Chen

Primary tabs

Title: Lightweight Resiliency Mechanism via Compiler Techniques

 

Chao Chen

Ph.D. Student in Computer Science

School of Computer Science 

College of Computing

Georgia Institute of Technology

 

Date: Monday, November 4, 2019

Time: 10:30 - 12:00 (EST)

Location: KACB 3126

 

 

Committee:

------------

Dr. Santosh Pande (Advisor, School of Computer Science, Georgia Institute of Technology)

Dr. Greg Eisenhauer (Advisor, School of Computer Science, Georgia Institute of Technology)

Dr. Ling Liu (School of Computer Science, Georgia Institute of Technology)

Dr. Vivek Sarkar (School of Computer Science,  Georgia Institute of Technology)

 

Abstract:

-----------

Transient  faults  are  a  significant  concern  for  emerging  extreme-scale  high  performance computing (HPC) systems.  

This nascent problem is exacerbated by technology  trends  toward  smaller  transistor  size,  higher  circuit  density  and  

he  use  of near-threshold voltage techniques to save power.  While transient faults in memories  can  be  managed  with  

parity  techniques,  faults  in  processing  components  are not so easily detectable and manageable.  These faults can 

cause major problems for HPC applications.  Faults in different CPU components manifest differently and are best 

approached in different ways.  Faults manifested in floating point units are highly likely to corrupt applications’ state 

without any warnings and lead to incorrect outputs (called Silent Data corruptions or SDCs), and faults in the integer 

computations are more likely to cause control problems and/or manifest themselves as addressing faults which cause 

application termination (named Soft Failures or SFs), because integer instructions tend to dominate control and address 

calculations in HPC applications.  While SDCs harm the confidence in computations and could lead to inaccurate scientific 

insights, SFs degrade system efficiency and performance; SFs require the impacted jobs to be restarted from 

their checkpoints and recomputing lost computations before continuing the normal operation.  To address these 

challenges, this thesis proposes a set of lightweight techniques to mitigate the impact of transient faults by both 

exploiting application properties for SDC detection, and by leveraging compiler techniques for recovery.  This work 

makes the following contributions: 

 

First, this thesis proposes LADR, a low-cost application-level SDC detector for scientific applications. LADR protects 

scientific applications from SDCs by watching for  data  anomalies in their state  variables.  It  employs  compile-time  

data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads 

while maintaining a high level of fault coverage with low false positive rates.

 

Second, this thesis proposes CARE, a light-weight compiler-assisted technique for on-the-fly repair of processes crashed 

by transient faults in the address path.  The goal of CARE is to facilitate repaired processes to simply continue their executions 

instead of being terminated and restarted. During the compilation of applications, CARE constructs a recovery kernel for each 

load/store. It traps segmentation faults caused by the use of corrupted addresses, extracts appropriate state from the suspended 

process and uses the recovery kernels to attempt to recreate a correct version of the address, so that it  can  retry  the  faulted  

load/store  and  continue  the  application.  CARE,  leveraging compile-time preparation and using segmentation faults as 

a detection mechanism, ensures that there is no run-time overhead under non-faulty execution and spends 

minimal time in recovery under a runtime fault.

 

Finally, despite the promising results achieved by CARE, the scope of recovery is very challenging for important runtime 

artifacts such as induction variable updates, which cause a significant portion of failures in many other scientific workloads.

To address this challenge, we look into the code optimization techniques in modern compilers, and found that some of these 

techniques, such as strength-reduction, can open up opportunities by turning  array  accesses  into  strength-reduced  pointers  

which are  updated  independently  in  lockstep.   Modified induction-variable-based strength-reduction allows independent  but  

equivalent  computations (patterns) so that a correct value for the corrupted pointer can be inferred from the value of another. 

Thus, smarter recovery kernels are designed to  recover from  a broader range of  soft  failures by exploiting “accidental” 

redundancy introduced by code optimization techniques with no impact on code speed.

 

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:10/28/2019
  • Modified By:Tatianna Richardson
  • Modified:10/28/2019

Categories

Keywords