<![CDATA[Ph.D. Dissertation Defense

682722 event 1749246798 1749246868 <![CDATA[Ph.D. Dissertation Defense - Woohong Byun]]> Title: Energy-Efficient Hardware Acceleration of Transformer-Based Models

Committee:

Dr. Saibal Mukhopadhyay, ECE, Chair, Advisor

Dr. Shimeng Yu, ECE

Dr. Visvesh Sathe, ECE

Dr. Callie Hao, ECE

Dr. Hyesoon Kim, CoC

]]> The objective of this research is to develop a software-hardware co-optimization framework for energy-efficient deployment of transformer-based language models, such as BERT and generative LLMs, on resource-constrained platforms such as FPGAs. This work addresses memory and computation challenges through novel quantization algorithms and custom accelerator designs. For BERT, a Hessian-based parameter-wise mixed-precision quantization method is proposed, assigning optimal precision to each parameter based on second-order sensitivity. To enhance hardware efficiency, a Hessian-driven row-wise weight quantization scheme is introduced, enabling mixed-precision matrices to be separated into two uniform-precision matrices, allowing all parameters to fit on-chip with the proposed FPGA accelerator. For generative LLMs, where memory demands scale with sequence length, a Weight-Hessian-aware KV cache quantization strategy is presented, applying intra-layer mixed-precision using precomputed Hessians to eliminate runtime overhead. To further reduce hardware complexity, a Query-Key coupled activation quantization method aligns bit precision of outer product pairs through Query-Key coupled Hessian analysis. A concurrent quantization approach jointly optimizes row-wise weight and Query-Key activation precision using multi-precision formats, improving compression and energy efficiency. These techniques are supported by a novel multi-precision FPGA accelerator for BERT and GPT-2, capable of handling both power-of-two and non-power-of-two bit-widths. With optimized dataflow, the design minimizes off-chip memory access and significantly outperforms existing solutions in energy efficiency and inference performance.

]]> <![CDATA[]]> https://teams.microsoft.com/l/meetup-join/19%3ameeting_OTM3MWZjZmMtY2UxMS00MzBkLWFiYTgtOWE2MjhiMDdhMjlj%40thread.v2/0?context=%7b%22Tid%22%3a%22482198bb-ae7b-4b25-8b7a-6d7f32faa083%22%2c%22Oid%22%3a%224f74ada8-7c29-4bba-a4ad-2cf7214f2aa0%22%7d 434381 1788 100811 1808