PhD Defense by Peng Li

Title: Cleaning and Learning over Dirty Tabular Data

Date: Friday, Dec 1, 2023

Time: 15:00 – 17:00 EST

Location: Teams Link

Peng Li

Ph.D. Candidate in Computer Science

School of Computer Science

College of Computing

Georgia Institute of Technology

Committee:

Dr. Xu Chu (advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Kexin Rong (co-advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology

Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology

Dr. Yeye He – Data Management, Exploration and Mining Group, Microsoft Research

Abstract:

The quality of machine learning (ML) applications is only as good as the quality of the data they train on. Unfortunately, real-world data is rarely free of errors, especially for tabular data, which frequently suffers from data issues like missing values, outliers, and inconsistencies. Therefore, data cleaning is widely regarded as an essential step in an ML workflow and an effective way to improve ML performance. However, data cleaning is often a time-consuming task that reportedly takes up to 80% of data scientists' time. Traditional data cleaning approaches often treat data cleaning as a standalone task independently of its downstream applications, which may not effectively improve ML performance and can sometimes even worsen it. Furthermore, it often leads to unnecessary costs for cleaning errors that have a minor impact on ML performance.

This dissertation jointly considers data cleaning and machine learning, and focuses on developing algorithms and systems for cleaning and learning over dirty tabular data, with the dual objectives of (1) optimizing downstream ML performance and (2) minimizing human efforts. We start with a CleanML empirical study that systematically evaluates the impact of data cleaning on downstream ML performance. We then present CPClean, a cost-effective human-involved data cleaning algorithm for ML that minimizes human cleaning efforts while preserving ML performance. We subsequently demonstrate DiffPrep, an automatic data preprocessing method that can efficiently select high-quality data preprocessing (cleaning) pipelines to maximize downstream ML performance without human involvement. Finally, to obviate the need for humans to manually program table-restructuring transformations, we present Auto-Tables that can automatically transform tables from non-standard formats into a standard format without any human effort. Combining the works in this dissertation, we build an end-to-end system for cleaning and learning over dirty tabular data.

Media

No media selected

Summary

Cleaning and Learning over Dirty Tabular Data

Details

Friday

Dec 1 2023

03:00pm - 05:00pm

Location: TEAMS

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow Status:Published
Created By:Tatianna Richardson
Created:11/20/2023
Modified By:Tatianna Richardson
Modified:11/20/2023

Mercury (Hg)