PhD Defense by Joo Hwan Lee

Event Details
  • Date/Time:
    • Thursday December 1, 2016 - Friday December 2, 2016
      1:00 pm - 2:59 pm
  • Location: KACB 2100
  • Phone:
  • URL:
  • Email:
  • Fee(s):
  • Extras:
No contact information submitted.

Summary Sentence: Relaxing Coherence for modern learning applications

Full Summary: No summary paragraph submitted.

Ph.D. Defense of Dissertation Announcement


Title : Relaxing Coherence for modern learning applications


Joo Hwan Lee

School of Computer Science

College of Computing

Georgia Institute of Technology


Date: Thursday, Dec 1, 2016

Time: 1 PM to 3 PM EST

Location: KACB 2100




Dr. Hyesoon Kim (Advisor, School of Computer Science, Georgia Tech) Dr. Richard Vuduc (School of Computational Science and Engineering, Georgia Tech) Dr. Hadi Esmaeilzadeh (School of Computer Science, Georgia Tech) Dr. Le Song (School of Computational Science and Engineering, Georgia Tech) Dr. Nuwan Jayasena (AMD Research, Advanced Micro Devices)




The main objective of this research is to efficiently execute learning (model training) of modern machine learning (ML) applications. The recent explosion in data has led to the emergence of data-intensive ML applications whose key phase is learning that requires significant amounts of computation. A unique characteristic of learning is that it is iterative-convergent, where a consistent view of memory does not always need to be guaranteed such that parallel workers are allowed to compute using stale values in intermediate computations to relax certain read-after-write data dependencies. While multiple workers read-and-modify shared model parameters multiple times during learning, incurring multiple data communication between workers, most of the data communication is redundant, due to the stale value tolerant characteristic. Relaxing coherence for these learning applications has the potential to provide extraordinary performance and energy benefits but requires innovations across the system stack from hardware and software.


While considerable effort has utilized the stale value tolerance on distributed learning, still inefficient utilization of the full performance potential of this characteristic has caused modern ML applications to have low execution efficiency on the state-of-the-art systems. The inefficiency mainly comes from the lack of architectural considerations and detailed understanding of the different stale value tolerance of different ML applications. Today’s architecture, designed to cater to the needs of more traditional workloads, incurs high and often unnecessary overhead. The lack of detailed understanding has led to un-clarity for do- main experts thus failing to take the full performance potential of the stale value tolerance. This dissertation presents several innovations regarding this challenge.


First, this dissertation proposes Bounded Staled Sync (BSSync), hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy, for reducing atomic operation overhead on data synchronization intensive workloads. The long latency and serialization caused by atomic operations have a significant impact on performance. The proposed technique overlaps the long latency atomic operation with the main computation. Compared to previous work that allows stale values for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. It reduces the inefficiency coming from the data movement between where they are stored and where they are processed.


Second, this dissertation presents StaleLearn, a learning acceleration mechanism to reduce the memory divergence overhead of GPU learning with sparse data. Sparse data induces divergent memory accesses with low locality, thereby consuming a large fraction of total execution time on transferring data across the memory hierarchy. StaleLearn transforms the problem of divergent memory accesses into the synchronization problem by replicating the model, and reduces the synchronization overhead by asynchronous synchronization on Processor-in-Memory (PIM). The stale value tolerance makes possible to clearly decompose tasks between the GPU and PIM, which can effectively exploit parallelism between PIM and GPU cores by overlapping PIM operations with the main computation on GPU cores.


Finally, this dissertation provides a detailed understanding of the different stale value tolerance of different ML applications. While relaxing coherence can reduce the data communication overhead, its complicated impact on the progress of learning has not been well studied thus leading to un-clarity for domain experts and modern systems. We define the stale value tolerance of ML training with the effective learning rate. The effective learning rate can be defined by the implicit momentum hyperparameter, the local update density, the activation function selection, RNN cell types, and learning rate adaptation. Findings of this work will open further exploration of asynchronous learning including improving the findings laid out in this dissertation.


Additional Information

In Campus Calendar

Graduate Studies

Invited Audience
Phd Defense
  • Created By: Tatianna Richardson
  • Workflow Status: Published
  • Created On: Nov 22, 2016 - 9:33am
  • Last Updated: Nov 22, 2016 - 9:33am