CS 744: Data Exploration, Cleaning, and Integration for Data Science

[See Canvas for the CS 774 homepage of a particular semester]

Course Description

Big Data is often said to deal with four Vs: volume, velocity, variety, and veracity. This course focuses on the variety and veracity challenges, which often arise in data science and AI projects.

In many such projects, data is often incorrect, hard to understand, and comes from a variety of sources. Data scientists often spend 80% of their effort to explore, clean, and integrate this data, before analysis can be carried out to extract insights. As a result, managing variety and veracity has received significant attention.

We will study these topics, understand their challenges, and discuss solutions. These solutions often require data management, machine learning, big data scaling, cloud, crowdsourcing, and user interaction techniques. We will discuss ongoing work in both academia and industry.

An unofficial motto of this course is "making messy data usable at scale."

Course Learning Outcomes

Identify and examine the key challenges of managing variety and veracity with large data sets. These include data acquisition, data extraction, data exploration, cleaning, matching, and merging data.
Summarize the variety and veracity solution approaches in academia and industry.
Design and apply course concepts to experiential learning through a research project.
Effectively communicate through written reports, oral presentations, and discussions.

Prerequisites

Knowledge of machine learning/AI (CS 540), databases (CS 564), and Python (CS 320) is desired, but not required.

Course Schedule

For each topic, we will examine: (1) the motivation, (2) fundamental concepts and techniques, (3) how the problem is addressed in practice and industry, and (4) current research directions.

Introduction

Part 1. The Raw-Data-to-Insight Pipeline

Data acquisition
Data extraction
Data exploration (browsing/querying, visualizing, profiling)
Data cleaning, enriching
String matching
Data matching and merging
Schema matching and merging
Data transformation
Data analysis

Part 2. Building Data Artifacts

Data warehouses, data lakes, lakehouses
Reference data, master data management (MDM)
Custom data platforms (CDPs), 360s
Knowledge graphs
Data catalogs, data governance

Part 3. Looking Forward

Data mesh, data fabric, LLMs
System issues
Challenges, future directions