5th IEEE International Conference on Data Science in Cyberspace (IEEE DSC 2020)
27-29 July 2020
Hong Kong, China
Keynote Speech

Data Cleaning: An Machine Learning Problem in Need of Data Systems Help

 

       

Ihab F. Ilyas
Cheriton School of Computer Science
University of Waterloo

 

Abstract

Data scientists spend big chunk of their time preparing, cleaning, and transforming raw data before getting the chance to feed this data to their well-crafted models. Despite the efforts to build robust predication and classification models, data errors still the main reason for having low quality results. This massive labor-intensive exercises to clean data remain the main impediment to automatic end-to-end AI pipeline for data science.

In this talk, I focus on data cleaning as an inference problem that can be automated by leveraging the great advancements in AI and ML in the last few years. I will start with a background describing the evolution of data cleaning efforts, and I will describe The HoloClean framework, a machine learning framework for data profiling and cleaning (error detection and repair). The framework has multiple successful deployments with cleaning census data, and pilots with commercial enterprises to boost the quality of source (training) data before feeding them to downstream analytics.

HoloClean builds two main probabilistic models: a data generation model (describing how data was intended to look like); and a realization model (describing how errors might be introduced to the intended clean data). The framework uses few-shot learning, data augmentation, and weak supervision to learn the parameters of these models, and use them to predict both error and their possible repairs.

While the idea of using statistical inference to model the joint data distribution of the underlying data is not new, the problem has been always: (1) how to scale a model with millions of data cells (corresponding to random variables); and (2) how to get enough training data to learn the complex models that are capable of accurately predicting the anomalies and the repairs. HoloClean tackles exactly these two problems.

 

Biography

Ihab Ilyas is a professor in the Cheriton School of Computer Science and the NSERC-Thomson Reuters Research Chair on data quality at the University of Waterloo. His main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, machine learning for data curation, and information extraction. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration, and he is also the co-founder of inductiv (now part of Apple), a Waterloo-based startup on using AI for structured data cleaning. He is a recipient of the Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award, and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair. He holds a PhD in computer science from Purdue University, West Lafayette.

Sponsors

IEEE

PolyU