event
PhD Proposal by Matthew Whitlock
Primary tabs
Title: Designing and Automating Asynchronous, Localized, Multi-Level Fault-Tolerance at the Application Level
Date: Monday, Nov 6, 2023
Time: 4:15 pm - 6:15 pm ET
Location: Klaus 3402 and Virtual
teams.microsoft.com
Matthew Whitlock
Ph.D. Student
School of Computer Science
Georgia Institute of Technology
Committee:
Dr. Vivek Sarkar (Advisor) - School of Computer Science, Georgia Institute of Technology
Dr Keita Teranishi - Programming Systems, Oak Ridge National Laboratory
Dr. Ada Gavrilovska - School of Computer Science, Georgia Institute of Technology
Dr. Tom Conte - School of Computer Science, Georgia Institute of Technology
Dr. Umakishore Ramachandran - School of Computer Science, Georgia Institute of Technology
Abstract:
Though hardware reliability improvements have extended the lifespan of traditional,
inefficient application resilience based on global Checkpoint/Recovery (C/R), it is increas-
ingly apparent that this burden has a cost. Year-over-year, chips implement more complex
functions and components to handle the reliability impacts of meeting performance, power,
and density demands of next-gen computing. Software-based resilience is no longer just a
necessary burden for long-running applications – it is a key component of hardware/soft-
ware codesign that opens the door for improvements in component performance, efficiency,
and cost. However, application-level resilience must display certain key properties before
these benefits can be realized at large scales. The traditional global teardown-restart re-
sponse to failures compounds the costs of faults and quickly reaches a scalability cliff;
resilience designs must localize the cost of fault tolerance with online process recovery,
asynchronous checkpointing, and preservation of progress on processes distant from those
lost. Meeting these standards requires complex, multi-layered resilience designs – of the
type developers are reticent to implement. We extend existing state-of-the-art resilience
tools and design new approaches for simplifying and enhancing the most difficult aspects of
contemporary fault tolerance. With them, we implement and evaluate algorithms capable of
maintaining performance even as fault rates exceed checkpoint rates. Through integrations
with modern programming models and composable layers of resilience, we demonstrate
highly effective avenues for relieving the burden of implementing optimal application- and
platform-tailored resilience in complex, asynchronous, and dynamic programs.
Groups
Status
- Workflow Status:Published
- Created By:Tatianna Richardson
- Created:11/01/2023
- Modified By:Tatianna Richardson
- Modified:11/01/2023
Categories
Keywords
Target Audience