PhD Defense by William Jonghoon Won

Title: Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms

Date: Tuesday, November 25, 2025

Time: 4:00 PM - 6:00 PM ET

Location: Virtual (https://gatech.zoom.us/j/99350294863?pwd=4Gdm6gNuHmQpb78ojyAbgnVITeqpgg.1)

William Jonghoon Won

Ph.D. Candidate

School of Computer Science

College of Computing

Georgia Institute of Technology

Committee:

Dr. Tushar Krishna (advisor) - School of Electrical and Computer Engineering & School of Computer Science, Georgia Institute of Technology

Dr. Yingyan (Celine) Lin - School of Computer Science, Georgia Institute of Technology

Dr. Divya Mahajan - School of Computer Science & School of Electrical and Computer Engineering, Georgia Institute of Technology

Dr. Manya Ghobadi - Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology

Dr. Bradford Beckmann - Research and Advanced Development, Advanced Micro Devices

Abstract:

Foundation machine learning (ML) models have emerged as one of the most prominent applications in modern computing, exemplified by mixture-of-experts (MoE)-based large language models (LLMs). The immense resource demands of these models have driven the development of large-scale high-performance computing (HPC) platforms tailored for artificial intelligence (AI) workloads. In such distributed platforms, both model parameters and data are partitioned and processed across numerous neural processing units (NPUs), requiring frequent synchronization of activations and gradients through collective communication operations. As collective communication constitutes a primary bottleneck in distributed ML, optimizing its efficiency remains a critical research challenge.

This dissertation explores software-hardware optimizations for collective communications to better understand the tightly coupled design space of networking in distributed ML platforms. First, it introduces ASTRA-sim2.0, an end-to-end simulation and modeling framework that enables comprehensive design space exploration of distributed ML systems with arbitrary parallelization strategies and multi-dimensional network fabrics. Second, it presents LIBRA, which enhances the bandwidth utilization of hierarchical collective communication algorithms by optimizing multi-dimensional network topologies via analytical modeling. Finally, the dissertation proposes collective communication algorithm synthesizers, TACOS and PCCL, which automatically generate optimized collective communication algorithms for arbitrary network topologies through algorithmic approaches. Together, the dissertation underscores the significance of judicious software-hardware approaches in achieving efficient collective communication for large-scale distributed ML platforms.

Media

No media selected

Summary

Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms

Details

Tuesday

Nov 25 2025

04:00pm - 06:00pm

Location: Virtual

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow status: Published
Created by: Tatianna Richardson
Created: 11/12/2025
Modified By: Tatianna Richardson
Modified: 11/12/2025

Mercury (Hg)