PhD Defense by Ankush Mandal

Title: Enabling Parallelism and Optimizations in Data Mining Algorithms for Power-law Data

————————————

Ankush Mandal

Ph.D. Candidate

School of Computer Science

College of Computing

Georgia Institute of Technology

Date: Tuesday, June 30th, 2020

Time: 1:00 PM – 3:00 PM ET

Location: https://bluejeans.com/vsarkar9 (remote)

Committee:

————————————

Dr. Vivek Sarkar (Advisor, School of Computer Science, Georgia Tech)

Dr. Anshumali Shrivastava (Co-advisor, Department of Computer Science, Rice University)

Dr. Hyesoon Kim (School of Computer Science, Georgia Tech)

Dr. Santosh Pande (School of Computer Science, Georgia Tech)

Dr. Richard Vuduc (School of Computational Science and Engineering, Georgia Tech)

Abstract:

————————————

In this era of big data, extracting meaningful information from a large amount of data in a reasonable time became possible mainly through two types of advancements --- a) algorithmic advances, such as fast approximate algorithms and efficient learning algorithms, and b) architectural advances, such as machines with massive compute power involving distributed multi-core processors and high throughput accelerators. The amount of parallelism available in current and future processors is ever-increasing, and thus parallel algorithms are critical for fully utilizing today’s computing resources. Furthermore, in the case of data mining problems, exploiting data properties for performance gain becomes crucial. In this work, we focus our attention on power-law behavior --- a common property found in a large class of data, such as text, internet traffic, and click-stream data. Specifically, we address the following questions in the context of power-law data: How well do the critical data mining algorithms of current interest fit with today's parallel architectures? Which algorithmic and mapping opportunities can be leveraged to further improve performance?, and What are the relative challenges and gains for such approaches?

Specifically, we first investigate the suitability of a classical application in large scale data stream processing --- the frequency estimation problem --- for GPU-scale parallelism. Sketching algorithms are a popular choice for this task due to their desirable trade-off between estimation accuracy and space-time efficiency. However, most of the past work on sketch-based frequency estimation focused on CPU implementations, with little exploration of throughput-optimized parallel architectures such as GPUs. In our work, we propose a novel approach for sketches, which exploits the natural skew in the power-law data to efficiently utilize the massive amounts of parallelism in modern GPUs.

Next, we explore one of the most common and important tasks in data mining --- identification of most frequent elements --- on modern distributed settings with both multi-core and multi-node CPU parallelism. For identifying top-K frequent items, Count-Min Sketch (CMS) has an excellent update time but lacks the important property of reducibility, which is needed for exploiting data parallelism. On the other end, the popular Frequent Algorithm (FA) leads to reducible summaries, but its update costs are high. Our approach Topkapi, gives the best of both worlds, i.e., it is reducible like FA and has an efficient update time similar to CMS. For power-law data, Topkapi possesses strong theoretical guarantees and leads to significant performance gains, relative to past work.

Finally, we study Word2Vec, a popular word embedding method widely used in Machine learning and Natural Language Processing applications, such as machine translation, sentiment analysis, and query answering. This time, we target Single Instruction Multiple Data (SIMD) parallelism. With the increasing vector lengths in commodity CPUs, such as AVX-512 with a vector length of 512 bits, efficient vector processing unit utilization becomes a major performance game-changer. By employing a static multi-version code generation strategy coupled with an algorithmic approximation based on the power-law frequency distribution of words, we achieve significant reductions in training time relative to the state-of-the-art.

Media

No media selected

Summary

Details

Tuesday

Jun 30 2020

02:00pm - 04:00pm

URL: https://bluejeans.com/vsarkar9

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow Status:Published
Created By:Tatianna Richardson
Created:06/16/2020
Modified By:Tatianna Richardson
Modified:06/16/2020

Mercury (Hg)