Ph.D. Defense by Nagesh Bangalore Lakshmin

Primary tabs

Title: Efficient Graph Algorithm Execution on Data-Parallel Architectures

Nagesh Bangalore Lakshminarayana
Ph.D. Candidate
School of Computer Science
College of Computing

Date: October 30, 2014 (Thursday)
Time: 11 AM ET
Location: KACB 2100

Dr. Hyesoon Kim, School of Computer Science (Advisor)
Dr. Santosh Pande, School of Computer Science
Dr. Milos Prvulovic, School of Computer Science
Dr. Moinuddin K. Qureshi, School of Electrical and Computer Engineering
Dr. Richard Vuduc, School of Computational Science and Engineering


Graph algorithms are an important component of applications from several domains such as social computing, bio-informatics, communication networks and so on. To improve the throughput of processing large graphs parallel versions of several graph algorithms have been designed. Data-parallel architectures such as GPUs with their high execution throughput and high memory bandwidth capabilities are a good candidate platform for executing such parallel graph algorithms.

However, GPU architectures are not a natural fit for graph algorithms. Due to their irregular control flow, non-uniform work distribution and data-dependent memory accesses, graph algorithms have a low execution efficiency on GPUs. The large number of data-dependent memory accesses often result in a significant number of idle cycles. Considerable work has been done on reducing the negative impact of control flow divergence in GPGPU applications and the same approaches can be applied to the execution of graph algorithms as well. The other problems of non-uniform work distribution and inability to tolerate memory access latency completely are much less explored and thus, provide opportunities for innovation.

In this dissertation I will present a graph algorithm specific prefetching mechanism and the evaluation of different design options for GPU cache hierarchies to improve the tolerance of graph algorithms to memory access latency. Lastly, I will examine the impact of graph algorithm implementation decisions.

The prefetching mechanism stores prefetched data in spare registers in addition to storing them in the cache as well. Due to resource allocation constraints, GPUs often have several unallocated registers during kernel execution. By using such unallocated registers the mechanism eliminates the loss of prefetched data due to evictions from the cache and makes prefetching more effective.

Unlike the traditional streaming GPGPU applications, graph algorithms exhibit some data locality and the small GPGPU caches can play an effective role in improving memory access latency tolerance. First, we evaluate how the cache inclusion property affects the performance of graph algorithms. Besides a non-inclusive cache hierarchy, we consider an exclusive cache hierarchy and other cache designs that selectively do exclusion including a new memory region based cache exclusion scheme. Second, we examine the impact of different cache bypass mechanisms on the performance of graph algorithms. Third, we examine the impact of fine-grained cache hierarchies that dynamically decide whether to fetch a full cache block or one or more cache sub-blocks. The evaluated schemes include a new PC-based mechanism for predicting the sub-blocks that are likely to be accessed in the future.

Finally, we examine the impact of different implementation strategies on graph algorithm performance, power and energy consumption and also the impact of software optimizations such as prefetching and loop unrolling.


  • Workflow Status: Published
  • Created By: Danielle Ramirez
  • Created: 10/17/2014
  • Modified By: Fletcher Moore
  • Modified: 10/07/2016

Target Audience

No target audience selected.