CS 744: Data Exploration, Cleaning, and Integration for Data Science
[See Canvas for the CS 774 homepage of a particular semester]
Course Description
Big Data is often said to deal with four Vs: volume, velocity, variety, and veracity. This course focuses on the variety and veracity challenges, which often arise in data science and AI projects.
In many such projects, data is often incorrect, hard to understand, and comes from a variety of sources. Data scientists often spend 80% of their effort to explore, clean, and integrate this data, before analysis can be carried out to extract insights. As a result, managing variety and veracity has received significant attention.
We will study these topics, understand their challenges, and discuss solutions. These solutions often require data management, machine learning, big data scaling, cloud, crowdsourcing, and user interaction techniques. We will discuss ongoing work in both academia and industry.
An unofficial motto of this course is "making messy data usable at scale."
Course Learning Outcomes
- Identify and examine the key challenges of managing variety and veracity with large data sets. These include data acquisition, data extraction, data exploration, cleaning, matching, and merging data.
- Summarize the variety and veracity solution approaches in academia and industry.
- Design and apply course concepts to experiential learning through a research project.
- Effectively communicate through written reports, oral presentations, and discussions.
Prerequisites
Knowledge of machine learning/AI (CS 540), databases (CS 564), and Python (CS 320) is desired, but not required.
Course Schedule
For each topic, we will examine: (1) the motivation, (2) fundamental concepts and techniques, (3) how the problem is addressed in practice and industry, and (4) current research directions.
Introduction
Part 1. The Raw-Data-to-Insight Pipeline
- Data acquisition
- Data extraction
- Data exploration (browsing/querying, visualizing, profiling)
- Data cleaning, enriching
- String matching
- Data matching and merging
- Schema matching and merging
- Data transformation
- Data analysis
Part 2. Building Data Artifacts
- Data warehouses, data lakes, lakehouses
- Reference data, master data management (MDM)
- Custom data platforms (CDPs), 360s
- Knowledge graphs
- Data catalogs, data governance
Part 3. Looking Forward
- Data mesh, data fabric, LLMs
- System issues
- Challenges, future directions