CS 744 Big Data Systems - UW Madison, Fall 2020

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:

Course Learning Objectives

At the end of the course you will be able to

Logistics

Pre-requisites

Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.

Grading

Schedule

Class Date Reading Lecture Material Notes
9/3 How to read a paper Slides Assignment 0
Infrastructure
9/8 The Datacenter as a Computer version 3, Chapter 1 and 2 Slides Slides+Notes
9/10 The Google File System
NFS: Sun's Network File System (optional)
Slides Slides+Notes Assignment 1 out
9/15 MapReduce:Simplified Data Processing on Large Clusters
MPI Tutorial Introduction and MPI Hello World
Slides Slides+Notes
9/17 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Slides Slides+Notes
Scheduling
9/21 Assignment 1 due
9/22 Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
YARN: Yet Another Resource Negotiator (Optional)
Slides Slides+Notes Assignment 2 out
9/24 DRF: Dominant Resource Fairness Slides Slides+Notes
Machine Learning
9/29 PyTorch Distributed: Experiences on AcceleratingData Parallel Training
Towards a Unified Architecture for in-RDBMS Analytics (Optional)
Slides Slides+Notes
10/1 PipeDream: Generalized Pipeline Parallelism for DNN Training
Submit project bids
10/5 Assignment 2 due
10/6 TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
10/8 Ray: A Distributed Framework for Emerging AI Applications
10/13 Clipper: A Low-Latency Online Prediction Serving System
SQL Frameworks
10/15 SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets
Spark SQL: Relational Data Processing in Spark(Optional)
10/16 Project Introductions Due
10/20 The Snowflake Elastic Data Warehouse
Building An Elastic Query Engine on Disaggregated Storage(Optional)
10/22 Midterm 1 In-class midterm
Stream Processing
10/27 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
10/29 Naiad: A Timely Dataflow System
11/3 Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Graph Processing
11/5 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
11/10 GraphX: Graph Processing in a Distributed Dataflow Framework
Scalability! But at what COST? (Optional)
11/12 PyTorch-BigGraph: A Large-scale Graph Embedding System Project check-ins due
New Data, Hardware Models
11/17 Occupy the Cloud: Distributed Computing for the 99%
11/19 Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations
Weld: A Common Runtime for High Performance Data Analytics (Optional)
11/24 In-Datacenter Performance Analysis of a Tensor Processing Unit Project peer-review
11/26 Happy Thanksgiving!
12/1 Fairness and machine learning: Limitations and Opportunities (Introduction)
Measuring discrepancies in Airbnb guest acceptance rates using anonymized demographic data (Page 1-15)
50 Years of Test (Un)fairness: Lessons for Machine Learning (Optional)
12/3 Midterm 2 In-class midterm
12/8 Course Review
12/10 Poster presentation
12/17 Final project reports