<node id="676890">
  <nid>676890</nid>
  <type>event</type>
  <uid>
    <user id="27707"><![CDATA[27707]]></user>
  </uid>
  <created>1726591250</created>
  <changed>1726591298</changed>
  <title><![CDATA[PhD Proposal by William Jonghoon Won]]></title>
  <body><![CDATA[<p><strong>Title:</strong> Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms</p><p>&nbsp;</p><p><strong>Date</strong>: Monday, September 23, 2024</p><p><strong>Time:</strong> 9:00 AM – 11:00 AM ET</p><p><strong>Location:</strong> Klaus 1212, (hybrid)&nbsp;<a href="https://gatech.zoom.us/j/94843770067?pwd=1kRevvLZLDTxm0N59mBoW70EdL1fbw.1">https://gatech.zoom.us/j/94843770067?pwd=1kRevvLZLDTxm0N59mBoW70EdL1fbw.1</a></p><p>&nbsp;</p><p><strong>William Jonghoon Won</strong></p><p>Ph.D. Student</p><p>School of Computer Science</p><p>College of Computing</p><p>Georgia Institute of Technology</p><p>&nbsp;</p><p><strong>Committee:</strong></p><p>Dr. Tushar Krishna (advisor) - School of Electrical and Computer Engineering &amp; School of Computer Science, Georgia Institute of Technology</p><p>Dr. Yingyan (Celine) Lin - School of Computer Science, Georgia Institute of Technology</p><p>Dr. Divya Mahajan - School of Computer Science &amp; School of Electrical and Computer Engineering, Georgia Institute of Technology</p><p>Dr. Manya Ghobadi - Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology</p><p>Dr. Bradford Beckmann - Research and Advanced Development, Advanced Micro Devices</p><p>&nbsp;</p><p><strong>Abstract:</strong></p><p>The advancement of large-scale Machine Learning (ML) models and their massive resource requirements has driven the development of specialized, distributed High-Performance Computing (HPC) platforms tailored to ML workloads. These platforms integrate multiple Neural Processing Units (NPUs) interconnected through custom network fabrics. Since ML models and data are distributed, frequent synchronization of activations and gradients among NPUs is required. This synchronization presents a major bottleneck in distributed ML, making efficient collective communication a pivotal research challenge.</p><p>&nbsp;</p><p>Given the tightly coupled co-design space of distributed ML, judicious software-hardware optimization approaches are essential. To address this, I first present (i) ASTRA-sim2.0, an end-to-end simulation and modeling framework that facilitates design space exploration of the distributed ML stack. Next, I present (ii) LIBRA, an analytical modeling framework that captures the end-to-end execution time of distributed ML on multi-dimensional networks. Through integration with optimizers, LIBRA identifies optimal multi-dimensional network design points. Finally, I introduce (iii) TACOS, an autonomous topology-aware collective algorithm synthesizer that leverages time-expanded network representation and link-chunk matching algorithms to automatically generate optimized collective algorithms for arbitrary target topologies.</p><p>&nbsp;</p>]]></body>
  <field_summary_sentence>
    <item>
      <value><![CDATA[Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms]]></value>
    </item>
  </field_summary_sentence>
  <field_summary>
    <item>
      <value><![CDATA[<p>Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms</p>]]></value>
    </item>
  </field_summary>
  <field_time>
    <item>
      <value><![CDATA[2024-09-23T09:00:00-04:00]]></value>
      <value2><![CDATA[2024-09-23T11:00:00-04:00]]></value2>
      <rrule><![CDATA[]]></rrule>
      <timezone><![CDATA[America/New_York]]></timezone>
    </item>
  </field_time>
  <field_fee>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_fee>
  <field_extras>
      </field_extras>
  <field_audience>
          <item>
        <value><![CDATA[Public]]></value>
      </item>
      </field_audience>
  <field_media>
      </field_media>
  <field_contact>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_contact>
  <field_location>
    <item>
      <value><![CDATA[Klaus 1212, ]]></value>
    </item>
  </field_location>
  <field_sidebar>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_sidebar>
  <field_phone>
    <item>
      <value><![CDATA[]]></value>
    </item>
  </field_phone>
  <field_url>
    <item>
      <url><![CDATA[]]></url>
      <title><![CDATA[]]></title>
            <attributes><![CDATA[]]></attributes>
    </item>
  </field_url>
  <field_email>
    <item>
      <email><![CDATA[]]></email>
    </item>
  </field_email>
  <field_boilerplate>
    <item>
      <nid><![CDATA[]]></nid>
    </item>
  </field_boilerplate>
  <links_related>
      </links_related>
  <files>
      </files>
  <og_groups>
          <item>221981</item>
      </og_groups>
  <og_groups_both>
          <item><![CDATA[Graduate Studies]]></item>
      </og_groups_both>
  <field_categories>
          <item>
        <tid>1788</tid>
        <value><![CDATA[Other/Miscellaneous]]></value>
      </item>
      </field_categories>
  <field_keywords>
          <item>
        <tid>100811</tid>
        <value><![CDATA[Phd Defense]]></value>
      </item>
      </field_keywords>
  <field_userdata><![CDATA[]]></field_userdata>
</node>
