PhD Defense by Mikhail Isaev

Title: Methodologies for co-designing supercomputer-scale systems and deep learning software

Mikhail Isaev

Computer Science PhD Student

School of Computational Science and Engineering

Georgia Institute of Technology

Date: Tuesday, Feb 20, 2024

Time: 13:00 – 15:00 EST

Location: C1015 (Vinings) and Teams

Committee:

Dr. Richard W. Vuduc (advisor), School of Computational Science and Engineering, Georgia Institute of Technology

Dr. Jeffrey Young, School of Computer Science, Georgia Institute of Technology

Dr. Tushar Krishna, School of Electrical and Computer Engineering, Georgia Institute of Technology

Dr. Hyesoon Kim, School of Computer Science, Georgia Institute of Technology

Dr. Nicholas G. McDonald, Nvidia Research

Abstract:

This dissertation introduces new methodologies to co-design deep learning software and supercomputer hardware for large-scale training.

The first is an analytical performance model to co-design large language models (LLM) and supercomputer architectures during the early phases of the system design process. On the algorithm side, we consider diverse implementation strategies, including data, tensor, and pipeline parallelism, communication-computation overlap, and memory optimization. The hardware aspect includes hierarchical memory systems, multiple interconnection networks, and parameterized efficiencies based on operation size. We implement it in Calculon, an open-source tool that allows estimating performance for billions of strategy and architecture combinations. This facilitates co-design-space exploration for future LLMs with trillions of parameters, yielding insights into optimal system characteristics and the interplay between algorithmic and architectural decisions.

As models exceed 100 trillion parameters, memory capacity and network speed become critical bottlenecks. For the former, Calculon suggests adding slower, high-capacity memory to store all intermediate tensors and model parameters while utilizing faster memory solely for current computation. For the latter, we present novel distributed-memory parallel matrix multiplication algorithms capable of hiding communication entirely, potentially achieving perfect scaling.

Looking ahead, we foresee a need to model artificial intelligence (AI) applications beyond LLMs and perform detailed system simulations in later design stages. Our second open-source tool, ParaGraph, translates compiled parallel programs into high-level graphs for emulator-based dynamic execution in network simulation environments. Case studies on deep learning workloads extracted from JAX and TensorFlow programs illustrate ParaGraph's utility for software-hardware co-design workflows, including communication optimization and hardware bottleneck identification.

Media

No media selected

Summary

Methodologies for co-designing supercomputer-scale systems and deep learning software

Details

Tuesday

Feb 20 2024

01:00pm - 03:00pm

Location: C1015 (Vinings) and Teams

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow Status:Published
Created By:Tatianna Richardson
Created:02/05/2024
Modified By:Tatianna Richardson
Modified:02/05/2024

Mercury (Hg)