event

PhD Defense by Srinivas Eswar

Primary tabs

Title: Scalable Data Mining via Constrained Low Rank Approximation

Date: Friday, July 1st, 2022

Time: 2pm - 4pm ET

Physical Location: Coda C1215 Midtown

Virtual Location: https://gatech.zoom.us/j/92347767822

 

Srinivas Eswar

School of Computational Science and Engineering

Georgia Institute of Technology

 

Committee:

Dr. Richard Vuduc (Advisor, School of Computational Science and Engineering, Georgia Institute of Technology)

Dr. Haesun Park (Co-Advisor, School of Computational Science and Engineering, Georgia Institute of Technology)

Dr. Ümit V. Çatalyürek (School of Computational Science and Engineering, Georgia Institute of Technology)

Dr. Edmond Chow (School of Computational Science and Engineering, Georgia Institute of Technology)

Dr. Grey Ballard (Department of Computer Science, Wake Forest University)

 

------------------------ 

 

Abstract:

Matrix and tensor approximation methods are recognised as foundational tools for modern data analytics. Their strength lies in their long history of rigourous and principled theoretical foundations, judicious formulations via various constraints, along with the availability of fast computer programs. Multiple constrained low rank approximation (CLRA) formulations exist for various commonly encountered tasks like clustering, dimensionality reduction, anomaly detection, amongst others. The primary challenge in modern data analytics is the sheer volume of data to be analysed, often requiring multiple machines to just hold the dataset in memory. This dissertation presents CLRA as a key enabler of scalable data mining in distributed-memory parallel machines.

 

Nonnegative Matrix Factorisation (NMF) is the primary CLRA method studied in this dissertation. NMF imposes nonnegativity constraints on the factor matrices and is popular for its interpretability and clustering prowess. The major bottleneck in most NMF algorithms is a distributed matrix-multiplication kernel. We develop the PLANC software package which includes efficient matrix-multiplication and matricised tensor times Khatri-Rao product kernels tailored to the CLRA case. It employs carefully designed parallel algorithms and data distributions to avoid unnecessary computation and communication. With the development of these key kernels, we can extend PLANC to a variety of cases including handling symmetry constraints, second-order methods, and multiple data modalities. We demonstrate the effectiveness of PLANC via scaling studies on the supercomputers at the Oak Ridge Leadership Computing Facility.

 

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:07/06/2022
  • Modified By:Tatianna Richardson
  • Modified:07/06/2022

Categories

Keywords