<![CDATA[Ph.D. Proposal Oral Exam - Abhimanyu Bambhaniya]]>

685253 event 1758748056 1758748182 <![CDATA[Ph.D. Proposal Oral Exam - Abhimanyu Bambhaniya]]> Title: Algorithmic Optimizations and Distributed Platform Design for LLM Inference

Committee:

Dr. Krishna, Advisor

Dr. Tumanov, Chair

Dr. Raychowdhury

Dr. Subramanian

]]> The objective of the proposed research is to develop systematic approaches to Next-Generation AI inference platform design that address the complex optimization challenges spanning multiple interdependent layers of modern LLM serving infrastructure. Large Language Model (LLM) inference platforms have become critical for deploying AI at scale, yet their design involves navigating an enormous space of architectural choices, from evolving model architectures and optimization techniques to heterogeneous hardware platforms. Current approaches rely heavily on empirical testing of limited configurations, lacking comprehensive frameworks for evaluating the interplay between workload characteristics, system optimizations, and hardware architectures. This disconnect between algorithmic innovations and hardware capabilities results in suboptimal deployments, inefficient resource utilization, and missed opportunities for performance improvements across the increasingly complex multi-stage inference pipelines that characterize real-world AI applications. This thesis proposes a systematic approach to AI inference platform design through comprehensive modeling and simulation frameworks. GenZ, an analytical framework, enables rapid exploration of LLM architectures and optimizations across diverse current and next-generation NPU platforms, providing insights for future hardware design. MIST extends this capability to full end-to-end inference pipelines, modeling complex multi-client scenarios with heterogeneous batching strategies and hardware configurations that existing simulators cannot capture. Additionally, it studies hardware-software co-design for sparse attention accelerators, demonstrating systematic methodologies for architectural optimization under sparsity constraints. Finally, we propose training recipes for structured N:M sparsity in transformer models, achieving up to 5% accuracy improvements in high-sparsity regimes through progressive gradient flow techniques. The thesis also proposes MoE optimization strategies for multi-host inference and next-generation platforms for agentic AI workloads.

]]> <![CDATA[]]> https://gatech.zoom.us/j/97053430439?pwd=gUNSlCe3phYeJMuc0FmOKvRc1ZBXAw.1 434371 1788 102851 1808