{"628115":{"#nid":"628115","#data":{"type":"event","title":"PhD Proposal by Chao Chen","body":[{"value":"\u003Cp\u003ETitle: Lightweight Resiliency Mechanism via Compiler Techniques\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EChao Chen\u003C\/p\u003E\r\n\r\n\u003Cp\u003EPh.D. Student in Computer Science\u003C\/p\u003E\r\n\r\n\u003Cp\u003ESchool of Computer Science\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003ECollege of Computing\u003C\/p\u003E\r\n\r\n\u003Cp\u003EGeorgia Institute of Technology\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDate: Monday, November 4, 2019\u003C\/p\u003E\r\n\r\n\u003Cp\u003ETime: 10:30 - 12:00 (EST)\u003C\/p\u003E\r\n\r\n\u003Cp\u003ELocation: KACB 3126\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003ECommittee:\u003C\/p\u003E\r\n\r\n\u003Cp\u003E------------\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Santosh Pande (Advisor, School of Computer Science, Georgia Institute of Technology)\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Greg Eisenhauer (Advisor,\u0026nbsp;School of Computer Science, Georgia Institute of Technology)\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Ling Liu (School of Computer Science, Georgia Institute of Technology)\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Vivek Sarkar (School of Computer Science,\u0026nbsp; Georgia Institute of Technology)\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EAbstract:\u003C\/p\u003E\r\n\r\n\u003Cp\u003E-----------\u003C\/p\u003E\r\n\r\n\u003Cp\u003ETransient\u0026nbsp; faults\u0026nbsp; are\u0026nbsp; a\u0026nbsp; significant\u0026nbsp; concern\u0026nbsp; for\u0026nbsp; emerging\u0026nbsp; extreme-scale\u0026nbsp; high\u0026nbsp; performance computing (HPC) systems. \u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EThis nascent problem is exacerbated by technology\u0026nbsp; trends\u0026nbsp; toward\u0026nbsp; smaller\u0026nbsp; transistor\u0026nbsp; size,\u0026nbsp; higher\u0026nbsp; circuit\u0026nbsp; density\u0026nbsp; and \u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ehe\u0026nbsp; use\u0026nbsp; of near-threshold voltage techniques to save power.\u0026nbsp; While transient faults in memories\u0026nbsp; can\u0026nbsp; be\u0026nbsp; managed\u0026nbsp; with \u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eparity\u0026nbsp; techniques,\u0026nbsp; faults\u0026nbsp; in\u0026nbsp; processing\u0026nbsp; components\u0026nbsp; are not so easily detectable\u0026nbsp;and manageable.\u0026nbsp; These faults can\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ecause major problems for HPC applications.\u0026nbsp; Faults in different CPU components manifest differently and are best\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eapproached in different ways.\u0026nbsp; Faults manifested in floating point units are highly likely to corrupt applications\u0026rsquo; state\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ewithout any warnings and lead to incorrect outputs (called Silent Data corruptions or SDCs), and faults in the integer\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ecomputations are more likely to cause control problems and\/or manifest themselves as addressing faults which cause\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eapplication termination (named Soft Failures or SFs), because integer instructions tend to dominate control and address\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ecalculations in HPC applications.\u0026nbsp; While SDCs harm the confidence in computations and could lead to inaccurate scientific\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Einsights, SFs degrade system efficiency and performance; SFs require the impacted jobs to be restarted from\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Etheir checkpoints and recomputing lost computations before continuing the normal operation.\u0026nbsp; To address these\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Echallenges, this thesis proposes a set of lightweight techniques to mitigate the impact of transient faults by both\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eexploiting application properties for SDC detection, and by leveraging compiler techniques for recovery.\u0026nbsp; This work\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Emakes the following contributions:\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EFirst, this thesis proposes LADR, a low-cost application-level SDC detector for scientific applications. LADR protects\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Escientific applications from SDCs by watching for\u0026nbsp; data\u0026nbsp; anomalies in their state\u0026nbsp; variables.\u0026nbsp; It\u0026nbsp; employs\u0026nbsp; compile-time \u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Edata-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ewhile maintaining a high level of fault coverage with low false positive rates.\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003ESecond, this thesis proposes CARE, a light-weight compiler-assisted technique for on-the-fly repair of processes crashed\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eby transient faults in the address path.\u0026nbsp; The goal of CARE is to facilitate repaired processes\u0026nbsp;to simply continue their executions\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Einstead of being terminated and restarted. During the compilation of applications, CARE constructs a recovery kernel for each\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eload\/store. It traps segmentation faults caused by the use of corrupted addresses, extracts appropriate state from the suspended\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eprocess and uses the recovery kernels to attempt to recreate a correct version of the address, so that it\u0026nbsp; can\u0026nbsp; retry\u0026nbsp; the\u0026nbsp; faulted \u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eload\/store\u0026nbsp; and\u0026nbsp; continue\u0026nbsp; the\u0026nbsp; application.\u0026nbsp; CARE,\u0026nbsp; leveraging compile-time preparation and using segmentation faults as\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ea detection mechanism, ensures that there is no run-time overhead under non-faulty execution and spends\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eminimal time in recovery under a runtime fault.\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EFinally, despite the promising results achieved by CARE, the scope of recovery is very challenging for important runtime\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eartifacts such as\u0026nbsp;induction variable updates, which cause a significant portion of failures in many other scientific workloads.\u003C\/p\u003E\r\n\r\n\u003Cp\u003ETo address this challenge, we look into the code optimization techniques in modern compilers, and found that some of these\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Etechniques, such as strength-reduction, can open up opportunities by turning\u0026nbsp; array\u0026nbsp; accesses\u0026nbsp; into\u0026nbsp; strength-reduced\u0026nbsp; pointers \u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Ewhich are\u0026nbsp; updated\u0026nbsp; independently\u0026nbsp; in\u0026nbsp; lockstep. \u0026nbsp; Modified induction-variable-based strength-reduction allows\u0026nbsp;independent\u0026nbsp; but\u0026nbsp;\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eequivalent\u0026nbsp; computations (patterns) so that\u0026nbsp;a correct value for the corrupted pointer can be inferred from the value of\u0026nbsp;another.\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EThus, smarter recovery kernels are designed to\u0026nbsp; recover from\u0026nbsp; a broader range of\u0026nbsp; soft\u0026nbsp; failures by exploiting\u0026nbsp;\u0026ldquo;accidental\u0026rdquo;\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003Eredundancy introduced by code optimization techniques with no impact on code speed.\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n","summary":null,"format":"limited_html"}],"field_subtitle":"","field_summary":"","field_summary_sentence":[{"value":"Lightweight Resiliency Mechanism via Compiler Techniques"}],"uid":"27707","created_gmt":"2019-10-28 13:19:52","changed_gmt":"2019-10-28 13:19:52","author":"Tatianna Richardson","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2019-11-04T10:30:00-05:00","event_time_end":"2019-11-04T12:00:00-05:00","event_time_end_last":"2019-11-04T12:00:00-05:00","gmt_time_start":"2019-11-04 15:30:00","gmt_time_end":"2019-11-04 17:00:00","gmt_time_end_last":"2019-11-04 17:00:00","rrule":null,"timezone":"America\/New_York"},"extras":[],"groups":[{"id":"221981","name":"Graduate Studies"}],"categories":[],"keywords":[{"id":"102851","name":"Phd proposal"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1788","name":"Other\/Miscellaneous"}],"invited_audience":[{"id":"78761","name":"Faculty\/Staff"},{"id":"78771","name":"Public"},{"id":"174045","name":"Graduate students"},{"id":"78751","name":"Undergraduate students"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[],"email":[],"slides":[],"orientation":[],"userdata":""}}}