event

PhD Proposal by Matthew Whitlock

Primary tabs

Title: Designing and Automating Asynchronous, Localized, Multi-Level Fault-Tolerance at the Application Level

 

Date: Monday, Nov 6, 2023

Time: 4:15 pm - 6:15 pm ET

Location: Klaus 3402 and Virtual

Join conversation

teams.microsoft.com

 

 

Matthew Whitlock

Ph.D. Student

School of Computer Science

Georgia Institute of Technology

 

Committee:

Dr. Vivek Sarkar (Advisor) - School of Computer Science, Georgia Institute of Technology

Dr Keita Teranishi - Programming Systems, Oak Ridge National Laboratory

Dr. Ada Gavrilovska - School of Computer Science, Georgia Institute of Technology

Dr. Tom Conte - School of Computer Science, Georgia Institute of Technology

Dr. Umakishore Ramachandran - School of Computer Science, Georgia Institute of Technology

 

Abstract:

Though hardware reliability improvements have extended the lifespan of traditional,

inefficient application resilience based on global Checkpoint/Recovery (C/R), it is increas-

ingly apparent that this burden has a cost. Year-over-year, chips implement more complex

functions and components to handle the reliability impacts of meeting performance, power,

and density demands of next-gen computing. Software-based resilience is no longer just a

necessary burden for long-running applications – it is a key component of hardware/soft-

ware codesign that opens the door for improvements in component performance, efficiency,

and cost. However, application-level resilience must display certain key properties before

these benefits can be realized at large scales. The traditional global teardown-restart re-

sponse to failures compounds the costs of faults and quickly reaches a scalability cliff;

resilience designs must localize the cost of fault tolerance with online process recovery,

asynchronous checkpointing, and preservation of progress on processes distant from those

lost. Meeting these standards requires complex, multi-layered resilience designs – of the

type developers are reticent to implement. We extend existing state-of-the-art resilience

tools and design new approaches for simplifying and enhancing the most difficult aspects of

contemporary fault tolerance. With them, we implement and evaluate algorithms capable of

maintaining performance even as fault rates exceed checkpoint rates. Through integrations

with modern programming models and composable layers of resilience, we demonstrate

highly effective avenues for relieving the burden of implementing optimal application- and

platform-tailored resilience in complex, asynchronous, and dynamic programs.

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:11/01/2023
  • Modified By:Tatianna Richardson
  • Modified:11/01/2023

Categories

Keywords

Target Audience