event

PhD Defense by Peng Li

Primary tabs

Title: Cleaning and Learning over Dirty Tabular Data

 

Date: Friday, Dec 1, 2023

Time: 15:00 – 17:00 EST

Location: Teams Link

 

Peng Li

Ph.D. Candidate in Computer Science

School of Computer Science

College of Computing

Georgia Institute of Technology

 

Committee: 

Dr. Xu Chu (advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Kexin Rong (co-advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology

Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology

Dr. Yeye He – Data Management, Exploration and Mining Group, Microsoft Research

 

Abstract: 

The quality of machine learning (ML) applications is only as good as the quality of the data they train on. Unfortunately, real-world data is rarely free of errors, especially for tabular data, which frequently suffers from data issues like missing values, outliers, and inconsistencies. Therefore, data cleaning is widely regarded as an essential step in an ML workflow and an effective way to improve ML performance. However, data cleaning is often a time-consuming task that reportedly takes up to 80% of data scientists' time. Traditional data cleaning approaches often treat data cleaning as a standalone task independently of its downstream applications, which may not effectively improve ML performance and can sometimes even worsen it. Furthermore, it often leads to unnecessary costs for cleaning errors that have a minor impact on ML performance.

 

This dissertation jointly considers data cleaning and machine learning, and focuses on developing algorithms and systems for cleaning and learning over dirty tabular data, with the dual objectives of (1) optimizing downstream ML performance and (2) minimizing human efforts. We start with a CleanML empirical study that systematically evaluates the impact of data cleaning on downstream ML performance. We then present CPClean, a cost-effective human-involved data cleaning algorithm for ML that minimizes human cleaning efforts while preserving ML performance. We subsequently demonstrate DiffPrep, an automatic data preprocessing method that can efficiently select high-quality data preprocessing (cleaning) pipelines to maximize downstream ML performance without human involvement. Finally, to obviate the need for humans to manually program table-restructuring transformations, we present Auto-Tables that can automatically transform tables from non-standard formats into a standard format without any human effort. Combining the works in this dissertation, we build an end-to-end system for cleaning and learning over dirty tabular data.

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:11/20/2023
  • Modified By:Tatianna Richardson
  • Modified:11/20/2023

Categories

Keywords

Target Audience