event

PhD Defense by Bobin Deng

Primary tabs

Title: Scalable Energy-efficient Microarchitectures with Computational Error Tolerance

 

Bobin Deng

Ph.D. Candidate

School of Computer Science

College of Computing

Georgia Institute of Technology

 

Date: April 19th, 2021 (Monday)

Time: 4:00 PM - 6:00 PM (EDT)

Location: *No Physical Location*

BlueJeans: https://bluejeans.com/9026676698

 

Committee:

Dr. Thomas M. Conte (advisor) - School of Computer Science, Georgia Institute of Technology

Dr. Hyesoon Kim - School of Computer Science, Georgia Institute of Technology

Dr. Alexandros Daglis - School of Computer Science, Georgia Institute of Technology

Dr. Arijit Raychowdhury - School of Electrical and Computer Engineering, Georgia Institute of Technology

Dr. Jeanine Cook - Computer Science Research Institute, Sandia National Laboratories 

 

 

Abstract:

Dennard scaling of conventional semiconductor technology has reached its limit resulting in issues pertaining to leakage current and threshold voltage. Energy-savings found at the transistor level by simply lowering supply voltage are no longer available for these devices (e.g., MOSFETs) and have reached the Landauer-Shannon limit. Recent proposals of millivolt switch technologies aim to extend the technology scaling roadmap by maintaining a high on/off ratio of drain current with a much lower supply voltage. However, high intermittent error probabilities in millivolt switches constraints their Vdd reduction for traditional architectures. Thus, there is an urgent need for scalable and energy-efficient micro-architectures with computational error tolerance. This dissertation systematically leverages the error detection and correction properties of the Redundant Residue Number System (RRNS) by varying the number of non-redundant (n) and redundant (r) components (residues), and selects and discusses trade-offs about configuration points from a two-dimensional (n, r)-RRNS design plane that meet certain capabilities of error detection and/or correction. Being able to efficiently handle resilience in this (n, r)-RRNS plane significantly improves reliability, allowing further Vdd reduction and energy savings.

 

First, the necessary implementation details of RRNS cores are discussed. Second, scalable RRNS microarchitectures that simultaneously support both error-correction and checkpointing with restart capabilities for uncorrectable errors are proposed. Third, novel RRNS-based adaptive checkpointing&restart mechanisms are designed that automatically guarantee reliability while minimizing the energy-delay product (EDP). Finally, the RRNS design space is explored to find the optimal (n, r) configuration points. For similar reliability when compared to a conventional binary core (running at high Vdd) without computational error tolerance, the proposed RRNS scalable micro-architecture reduces EDP by 53% on average for memory-intensive workloads and by 67% on average for non-memory-intensive workloads.

 

This dissertation's second topic is to alleviate fault rate and power consumption issues of exascale computing. Faults in High-Performance Computing (HPC) have become an urgent challenge with estimated Mean Time Between Failures (MTBF) of exascale system projected as only several minutes with contemporary methodologies. Unfortunately, existing error-tolerance technologies in the context of HPC systems have serious deficiencies such as insufficient error-tolerance coverage, high power consumption, and difficult integration with existing workloads. Considering Department of Energy (DOE) guidelines that limit exascale power consumption to 20 MW, this dissertation highlights the issue of energy usage and proposes a thread-level fault tolerance mechanism compatible with current state-of-the-art exascale programming models while simultaneously meeting the requirements of full system error protection. Additionally, an efficient microarchitecture and corresponding mechanisms that can support thread-level RRNS are discussed. Experimental results show that this strategy reduces energy consumption by 62.25% and the Energy-Delay-Product by 58.67% on average when compared with state-of-the-art black box resilience techniques.

 

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:04/06/2021
  • Modified By:Tatianna Richardson
  • Modified:04/06/2021

Categories

Keywords