event
PhD Defense by William Jonghoon Won
Primary tabs
Title: Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms
Date: Tuesday, November 25, 2025
Time: 4:00 PM - 6:00 PM ET
Location: Virtual (https://gatech.zoom.us/j/99350294863?pwd=4Gdm6gNuHmQpb78ojyAbgnVITeqpgg.1)
William Jonghoon Won
Ph.D. Candidate
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Dr. Tushar Krishna (advisor) - School of Electrical and Computer Engineering & School of Computer Science, Georgia Institute of Technology
Dr. Yingyan (Celine) Lin - School of Computer Science, Georgia Institute of Technology
Dr. Divya Mahajan - School of Computer Science & School of Electrical and Computer Engineering, Georgia Institute of Technology
Dr. Manya Ghobadi - Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
Dr. Bradford Beckmann - Research and Advanced Development, Advanced Micro Devices
Abstract:
Foundation machine learning (ML) models have emerged as one of the most prominent applications in modern computing, exemplified by mixture-of-experts (MoE)-based large language models (LLMs). The immense resource demands of these models have driven the development of large-scale high-performance computing (HPC) platforms tailored for artificial intelligence (AI) workloads. In such distributed platforms, both model parameters and data are partitioned and processed across numerous neural processing units (NPUs), requiring frequent synchronization of activations and gradients through collective communication operations. As collective communication constitutes a primary bottleneck in distributed ML, optimizing its efficiency remains a critical research challenge.
This dissertation explores software-hardware optimizations for collective communications to better understand the tightly coupled design space of networking in distributed ML platforms. First, it introduces ASTRA-sim2.0, an end-to-end simulation and modeling framework that enables comprehensive design space exploration of distributed ML systems with arbitrary parallelization strategies and multi-dimensional network fabrics. Second, it presents LIBRA, which enhances the bandwidth utilization of hierarchical collective communication algorithms by optimizing multi-dimensional network topologies via analytical modeling. Finally, the dissertation proposes collective communication algorithm synthesizers, TACOS and PCCL, which automatically generate optimized collective communication algorithms for arbitrary network topologies through algorithmic approaches. Together, the dissertation underscores the significance of judicious software-hardware approaches in achieving efficient collective communication for large-scale distributed ML platforms.
Groups
Status
- Workflow Status:Published
- Created By:Tatianna Richardson
- Created:11/12/2025
- Modified By:Tatianna Richardson
- Modified:11/12/2025
Categories
Keywords
Target Audience