CS 744: Data Exploration, Cleaning, and Integration for Data Science

[See Canvas for the CS 774 homepage of a particular semester]

Course Description

Big Data is often said to deal with four Vs: volume, velocity, variety, and veracity. This course focuses on the variety and veracity challenges, which often arise in data science and AI projects.

In many such projects, data is often incorrect, hard to understand, and comes from a variety of sources. Data scientists often spend 80% of their effort to explore, clean, and integrate this data, before analysis can be carried out to extract insights. As a result, managing variety and veracity has received significant attention.

We will study these topics, understand their challenges, and discuss solutions. These solutions often require data management, machine learning, big data scaling, cloud, crowdsourcing, and user interaction techniques. We will discuss ongoing work in both academia and industry.

An unofficial motto of this course is "making messy data usable at scale."


Course Learning Outcomes


Prerequisites

Knowledge of machine learning/AI (CS 540), databases (CS 564), and Python (CS 320) is desired, but not required.


Course Schedule

For each topic, we will examine: (1) the motivation, (2) fundamental concepts and techniques, (3) how the problem is addressed in practice and industry, and (4) current research directions.

Introduction

Part 1. The Raw-Data-to-Insight Pipeline

Part 2. Building Data Artifacts

Part 3. Looking Forward