event
PhD Defense by Peng Li
Primary tabs
Title: Cleaning and Learning over Dirty Tabular Data
Date: Friday, Dec 1, 2023
Time: 15:00 – 17:00 EST
Location: Teams Link
Peng Li
Ph.D. Candidate in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Dr. Xu Chu (advisor) – School of Computer Science, Georgia Institute of Technology
Dr. Kexin Rong (co-advisor) – School of Computer Science, Georgia Institute of Technology
Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology
Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology
Dr. Yeye He – Data Management, Exploration and Mining Group, Microsoft Research
Abstract:
The quality of machine learning (ML) applications is only as good as the quality of the data they train on. Unfortunately, real-world data is rarely free of errors, especially for tabular data, which frequently suffers from data issues like missing values, outliers, and inconsistencies. Therefore, data cleaning is widely regarded as an essential step in an ML workflow and an effective way to improve ML performance. However, data cleaning is often a time-consuming task that reportedly takes up to 80% of data scientists' time. Traditional data cleaning approaches often treat data cleaning as a standalone task independently of its downstream applications, which may not effectively improve ML performance and can sometimes even worsen it. Furthermore, it often leads to unnecessary costs for cleaning errors that have a minor impact on ML performance.
This dissertation jointly considers data cleaning and machine learning, and focuses on developing algorithms and systems for cleaning and learning over dirty tabular data, with the dual objectives of (1) optimizing downstream ML performance and (2) minimizing human efforts. We start with a CleanML empirical study that systematically evaluates the impact of data cleaning on downstream ML performance. We then present CPClean, a cost-effective human-involved data cleaning algorithm for ML that minimizes human cleaning efforts while preserving ML performance. We subsequently demonstrate DiffPrep, an automatic data preprocessing method that can efficiently select high-quality data preprocessing (cleaning) pipelines to maximize downstream ML performance without human involvement. Finally, to obviate the need for humans to manually program table-restructuring transformations, we present Auto-Tables that can automatically transform tables from non-standard formats into a standard format without any human effort. Combining the works in this dissertation, we build an end-to-end system for cleaning and learning over dirty tabular data.
Groups
Status
- Workflow Status:Published
- Created By:Tatianna Richardson
- Created:11/20/2023
- Modified By:Tatianna Richardson
- Modified:11/20/2023
Categories
Keywords
Target Audience