<![CDATA[Ph.D. Dissertation Defense

618980 event 1551997583 1551997583 <![CDATA[Ph.D. Dissertation Defense - Hardik Sharma]]> Title: Accelerate Deep Learning for the Edge-to-cloud Continuum: A Specialized Full Stack Derived from Algorithms

Committee:

Dr. Hadi Esmaeilzadeh, ECE, Chair , Advisor

Dr. Hyesoon Kim, CoC

Dr. Milos Prvulovic, CoC

Dr. Tushar Krishna, ECE

Dr. Vikas Chandra, Facebook

Abstract:

Advances in high-performance computer architecture design has been a major driver for the rapid evolution of Deep Neural Networks (DNN). Due to their insatiable demand for compute power, naturally, both the research community as well the industry have turned to accelerators to accommodate modern DNN computation. Furthermore, DNNs are gaining prevalence and have found applications across a wide spectrum of devices, from commod- ity smartphones to enterprise cloud platforms. However, there is no one-size-fits-all solu- tion for this continuum of devices that can meet the strict energy/power/chip-area budgets for edge devices and meet the high performance requirements for enterprise-grade servers. This thesis designs a specialized compute stack for DNN acceleration across the edge- to-cloud continuum that flexibly matches the varying constraints for different devices and simultaneously exploit algorithmic properties to maximize the benefits from acceleration. To this end, this thesis first explores a tight integration of Neural Network (NN) accelerators within the massively-parallel GPUs with a minimal area overhead. We show that a tight- coupling of NN-accelerators and GPUs can provide a significant gain in performance and energy efficiency across a diverse set of applications through neural acceleration, by ap- proximating regions of approximation-amenable code using a neural networks. Next, this thesis develop a full-stack for accelerating DNN inference on FPGAs that encompasses (1) high-level algorithmic abstractions, (2) a flexible template accelerator architecture, and (3) a compiler that automatically and efficiently optimizes the template architecture to max- imize DNN performance using the limited resources available on the FPGA die. Next, this thesis explores scale-out acceleration of training using cloud-scale FPGAs for a wide range of machine learning algorithms, including neural networks. The challenge here is to design an accelerator architecture that can scale-up to efficiently use the large pool of compute resources available on modern cloud-grade FPGAs. To tackle this challenge, this thesis explores multi-threading to maximize efficiency from FPGA acceleration by running multiple parallel threads of training. Then, this thesis builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classi- fication accuracy. However, to prevent loss of accuracy, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer individually. To alleviate these deficiencies, the second thrust introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. Finally, this thesis explores mixed-signal acceleration to push accelerator efficiency to its limits. While mixed-signal circuitry promises significant efficiency benefits, they suffer from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads. This thesis addresses these challenges by offering and leveraging the insight that a vector dot-product (the basic operation in DNNs) can be bit-partitioned into groups of spatially parallel low-bitwidth operations, and interleaved across multiple elements of the vectors. As such, the building blocks of our accelerator become a group of wide, yet low-bitwidth multiply-accumulate units that operate in the analog domain and share a single A/D converter. Using this bit-partitioned building block, we design a 3D-stacked accelerator architecture that can provide significant gains in efficiency over purely-digital state-of-the-art 3D-stacked accelerator, without losing any classification accuracy.

]]> <![CDATA[]]> 434381 1788 100811 1808