{"682722":{"#nid":"682722","#data":{"type":"event","title":"Ph.D. Dissertation Defense - Woohong Byun","body":[{"value":"\u003Cp\u003E\u003Cstrong\u003ETitle\u003C\/strong\u003E\u003Cem\u003E:\u0026nbsp; Energy-Efficient Hardware Acceleration of Transformer-Based Models\u003C\/em\u003E\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ECommittee:\u003C\/strong\u003E\u003C\/p\u003E\u003Cp\u003EDr. Saibal Mukhopadhyay, ECE, Chair, Advisor\u003C\/p\u003E\u003Cp\u003EDr. Shimeng Yu, ECE\u003C\/p\u003E\u003Cp\u003EDr. Visvesh Sathe, ECE\u003C\/p\u003E\u003Cp\u003EDr. Callie Hao, ECE\u003C\/p\u003E\u003Cp\u003EDr. Hyesoon Kim, CoC\u003C\/p\u003E","summary":"","format":"limited_html"}],"field_subtitle":"","field_summary":[{"value":"\u003Cp\u003EThe objective of this research is to develop a software-hardware co-optimization framework for energy-efficient deployment of transformer-based language models, such as BERT and generative LLMs, on resource-constrained platforms such as FPGAs. This work addresses memory and computation challenges through novel quantization algorithms and custom accelerator designs. For BERT, a Hessian-based parameter-wise mixed-precision quantization method is proposed, assigning optimal precision to each parameter based on second-order sensitivity. To enhance hardware efficiency, a Hessian-driven row-wise weight quantization scheme is introduced, enabling mixed-precision matrices to be separated into two uniform-precision matrices, allowing all parameters to fit on-chip with the proposed FPGA accelerator. For generative LLMs, where memory demands scale with sequence length, a Weight-Hessian-aware KV cache quantization strategy is presented, applying intra-layer mixed-precision using precomputed Hessians to eliminate runtime overhead. To further reduce hardware complexity, a Query-Key coupled activation quantization method aligns bit precision of outer product pairs through Query-Key coupled Hessian analysis. A concurrent quantization approach jointly optimizes row-wise weight and Query-Key activation precision using multi-precision formats, improving compression and energy efficiency. These techniques are supported by a novel multi-precision FPGA accelerator for BERT and GPT-2, capable of handling both power-of-two and non-power-of-two bit-widths. With optimized dataflow, the design minimizes off-chip memory access and significantly outperforms existing solutions in energy efficiency and inference performance.\u003C\/p\u003E","format":"limited_html"}],"field_summary_sentence":[{"value":"Energy-Efficient Hardware Acceleration of Transformer-Based Models "}],"uid":"28475","created_gmt":"2025-06-06 21:53:18","changed_gmt":"2025-06-06 21:54:28","author":"Daniela Staiculescu","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2025-06-16T11:00:00-04:00","event_time_end":"2025-06-16T13:00:00-04:00","event_time_end_last":"2025-06-16T13:00:00-04:00","gmt_time_start":"2025-06-16 15:00:00","gmt_time_end":"2025-06-16 17:00:00","gmt_time_end_last":"2025-06-16 17:00:00","rrule":null,"timezone":"America\/New_York"},"location":"Online","extras":[],"related_links":[{"url":"https:\/\/teams.microsoft.com\/l\/meetup-join\/19%3ameeting_OTM3MWZjZmMtY2UxMS00MzBkLWFiYTgtOWE2MjhiMDdhMjlj%40thread.v2\/0?context=%7b%22Tid%22%3a%22482198bb-ae7b-4b25-8b7a-6d7f32faa083%22%2c%22Oid%22%3a%224f74ada8-7c29-4bba-a4ad-2cf7214f2aa0%22%7d","title":"Microsoft Teams Meeting link"}],"groups":[{"id":"434381","name":"ECE Ph.D. Dissertation Defenses"}],"categories":[],"keywords":[{"id":"100811","name":"Phd Defense"},{"id":"1808","name":"graduate students"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1788","name":"Other\/Miscellaneous"}],"invited_audience":[{"id":"78771","name":"Public"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[],"email":[],"slides":[],"orientation":[],"userdata":""}}}