PhD Defense by Girish Mururu
Title: Compiler Guided Scheduling : A Cross-stack Approach for Performance Elicitation
School of Computer Science
College of Computing
Georgia Institute of Technology
Date: Monday, August 10th, 2020
Time: 3:00 PM – 5:00 PM ET
Location (virtual): https://bluejeans.com/808029869
**Note: This defense is remote-only due to the institute's guidelines on COVID-19**
Dr. Santosh Pande (Advisor), School of Computer Science, Georgia Institute of Technology
Dr. Ada Gavrilovska, School of Computer Science, Georgia Institute of Technology
Dr. Kishore Ramachandran, School of Computer Science, Georgia Institute of Technology
Dr. Vivek Sarkar, School of Computer Science, Georgia Institute of Technology
Dr. Tushar Krishna, School of Electrical and Computer Engineering, Georgia Institute of Technology
Modern software executes on multi/many-core systems that share resources such as several levels of memory hierarchy (caches, main memory, secondary storage), I/O and network interfaces. In such a co-execution environment, the performance of modern software is critically affected due to resource conflicts arising from sharing of these resources. The resource requirements vary not only across the processes but also during the execution of a given process. Current resource management techniques involving OS schedulers have evolved from and mainly rely on the principles of fairness (achieved through time-multiplexing) and load-balancing and are oblivious to the dynamic resource requirements of individual processes. On the other hand, compiler research has traditionally evolved around optimizing single and multi-threaded programs limited to one process; compilers on the other hand can analyze the process resource requirements. This thesis contends that a significant performance enhancement can be achieved through the compiler guidance of schedulers in terms of dynamic program characteristics and resource needs.
Towards compiler guided scheduling, we first look at the problem of process migration. For load-balancing purposes, OS schedulers such as CFS can migrate threads when they are in the middle of an intense memory reuse region thus destroying warmed up caches, TLBs. To solve this problem while providing enough flexibility for load-balancing, we propose PinIt, which first determines the regions of a program in which the process should be pinned onto a core so that adverse migrations causing excessive cache and TLB misses are avoided. The thesis proposes new measures such as unique memory reuse and memory reuse density, that capture the performance penalties incurred due to migration. The compiler analysis determines minimal program regions to be pinned and pin/unpin calls are then hoisted at the entry and exits of the region; the migrations being prevented in pinned regions. In an overloaded environment, compared to priority-cfs, PinIt speeds up high-priority applications in mediabench workloads by 1.16x and 2.12x and in computer vision-based workloads by 1.35x and 1.23x on 8 cores and 16 cores, respectively, with almost same or better throughput for low-priority applications.
A critical problem of co-location of processes that share resources must be solved for efficiency in a co-execution environment. Towards this, several approaches proposed in the literature that rely on static profile data or dynamic performance counter based information exhibit significant suboptimality due to their inherent inability to use dynamic information in an anticipatory (proactive) manner. This thesis proposes Beacons, a generic compiler guided framework that instruments the programs with generated models or equations of specific characteristics of the program and provides a runtime counterpart that delivers the dynamically generated information to the scheduler. In the thesis, a new timing analysis is developed with an accuracy of 84% and the beacons relay the cache footprint and locality classification information to the schedulers. The thesis presents two schedulers that leverage these beacons, one that targets the problem of co-scheduling maximizing throughput called Beacon Enabled Scheduler(BES), and the other that targets the problem of co-location minimizing latency with fairness called Bellator. A prototype of BES improves throughput over the default Linux scheduler (CFS) by up to 4.7x on ThunderX and up to 5.2x on ThunderX2 servers for consolidated workloads. A prototype of Bellator on ThunderX2 with 224 hardware threads achieves lower 100th percentile latency improvement by 14% on average while executing 108 and 162 simultaneous processes and by 3% on average for 54 and 216 simultaneous processes.
The thesis also provides a preview of how beacons with cache misses information, modeled similar to the timing analysis, can enable secure co-location of processes in multi-tenant environments by detecting and mitigating cache-based side-channel attacks. Our beacon-based scheduler solution detects and mitigates attacks through all well-known cache-based side-channel techniques -- Prime+Probe, Flush+Reload, Flush+Flush-- on OpenSSL cryptography algorithms in multi-tenant environments.