event

PhD Proposal by Girish Mururu

Primary tabs

Title: Compiler Guided Scheduling : A Cross-stack Approach for Performance Elicitation

 

Girish Mururu

Ph.D. Student in Computer Science

School of Computer Science 

College of Computing

Georgia Institute of Technology

 

Date: Tuesday, November 12, 2019

Time: 1:30 - 3:30 pm (EST)

Location: KACB 2100

 

 

Committee :

-------------------

Dr. Santosh Pande (Advisor, School of Computer Science, Georgia Institute of Technology)

Dr. Ada Gavrilovska (School of Computer Science, Georgia Institute of Technology)

Dr. Kishore Ramachandran (School of Computer Science, Georgia Institute of Technology)

Dr. Vivek Sarkar (School of Computer Science,  Georgia Institute of Technology)

 

Abstract :

-------------

Modern software executes on multi-core systems that share resources such as

several levels of memory hierarchy (caches, main memory as well as persistent storage),

as well as I/O devices. In such a co-execution environment, the performance of modern

software is critically affected due to the resource conflicts resulting due to the

sharing of resources. Compiler optimizations have traditionally focused on analyzing

and optimizing the performance of individual (single or multi-threaded) applications

and have resulted in tremendous strides in this regard. On the other hand, schedulers

have dealt with the problem of resource sharing mostly adopting fairness as the primary

criterion in terms of single core sharing while adopting load balancing and processor affinity

as the criteria in terms of multi-core scheduling. Both the compiler and the scheduler stacks have

continued to evolve stand-alone; this thesis claims that there is a significant opportunity

to improve performance (execution speed of individual applications), resource sharing and throughput

by developing a synergistic approach that combines compiler generated dynamic application attributes

with runtime aggregation of such information to undertake smart scheduling decisions.

One of the key goals of this work is to perform such a cross-stack approach  by utilizing the

existing systems interfaces without introducing new ones. Another important characteristic is that the

approach is built with a layering solution without modifying the OS. Such a design is envisioned not to

perturb other systems properties which have evolved over a period of time.  

 

Modern workloads related to computer vision, media computation and machine learning exhibit a very

high amount of data locality. Although modern OS deploys processor affinity to induce data

locality aware scheduling, lack of knowledge of precise dynamic application characteristics leaves a significant

performance inefficiency on the table due to a significant number of process migrations carried

out by the scheduler. In our first work, PinIt, we decrease unwanted migrations  of processes among

cores by only influencing the scheduler without modifying it. In order to fairly and efficiently

utilize cores, schedulers such as  CFS migrate threads between cores during execution. Although such

thread migrations alleviate the problem of stalling and yield better core utilization, they can

also destroy data locality, resulting in  fewer cache hits, TLB hits and thus performance loss for

the  group of applications collectively.  

PinIt first determines the regions of a program in which the process should be pinned onto a core

so that adverse migrations causing excessive cache and TLB misses are avoided by calculating memory

reuse density, a  new measure that quantifies the reuses within code regions. Pin/unpin calls are

then hoisted at the entry and exits of the region. The migration of the processes is prevented within

the pinned regions. The thesis presents new analyses and transformations that optimize the placement

of such calls. In an overloaded environment compared to priority-cfs, PinIt speeds up high-priority

applications in mediabench workloads by 1.16x and 2.12x  and in computer vision-based workloads by

1.35x and 1.23x on 8cores and 16cores, respectively, with almost the same or better throughput for

low-priority applications.

 

In the second work, to achieve  very high throughput for large number of batch jobs, we develop a

throughput-oriented scheduler that processes compiler inserted beacon that transmit applications' dynamic

information to the scheduler at runtime. Typically, schedulers conservatively co-locate processes to avoid

cache conflicts since miss penalties are quite heavy leading to lower resource utilization

((ranging from 50 to 70%). Moreover in a throughput oriented setting, such a conservative scheduling

leads to significant losses in terms of achieved throughput. Our approach relies on the compiler to

insert ``beacons'' in the application at strategic program points to periodically produce and/or

update details of anticipated resource-heavy program region(s). The compiler classifies loops in

programs based on cache usage and predictability of their execution time and inserts different types

of beacons at their entry/exit points. The precision  of the information carried by beacons varies as

per the analyzability of the loops, and the scheduler uses performance counters at runtime to fine

tune decision making for concurrency. The information produced by beacons in multiple processes is aggregated

and analyzed by the predictive scheduler to proactively respond to the anticipated workload requirements.

A framework prototype demonstrates high-quality predictions and improvements in throughput over CFS by

up to 4.7x on ThunderX and up to 5.2x on ThunderX2 servers for consolidated workloads.

 

Finally, as a part of proposed work,  we devise extensions to the beacon scheduler to target

low latency environment and also efficiently handle multi-threaded processes. To achieve low latency

while efficiently using resources, schedulers  must divide the available processes among the available

cores such that latency is the lowest for each process. In case of multi-threaded processes, the compiler

and the runtime  must send beacons for each thread and the scheduler must manage resources efficiently

among all the inter-process threads. We also plan to augment the beacon analysis with  path and

call chain prediction to be able to predict the entire workload of an application at the start and

then send corrective beacons for mispredictions for a complete predictive scheduler.

 

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:11/06/2019
  • Modified By:Tatianna Richardson
  • Modified:11/06/2019

Categories

Keywords