CS 744 Big Data Systems - UW Madison, Fall 2019

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:

Course Learning Objectives

At the end of the course you will be able to

Logistics

Pre-requisites

Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.

Grading

Schedule

Class Date Reading Lecture Material Notes
9/5 How to read a paper Slides Fill out background survey
Assignment 0
Infrastructure
9/10 The Datacenter as a Computer version 3, Chapter 1 and 2 Slides
9/12 The Google File System
NFS: Sun's Network File System (optional)
Slides Assignment 1 out
9/17 MapReduce:Simplified Data Processing on Large Clusters Slides
9/19 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Slides
Scheduling
9/24 Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
YARN: Yet Another Resource Negotiator (Optional)
Slides
9/25 Assignment 1 due
9/26 DRF: Dominant Resource Fairness Slides
9/27 Assignment 2 out
Machine Learning
10/1 Towards a Unified Architecture for in-RDBMS Analytics Slides
10/3 Scaling Distributed Machine Learning with the Parameter Server Slides
10/7 Submit project topics, groups
10/8 Tensorflow: A system for large-scale machine learning Slides
10/10 Ray: A Distributed Framework for Emerging AI Applications Slides
10/11 Assignment 2 due
10/15 Clipper: A Low-Latency Online Prediction Serving System Slides
10/17 Gandiva: Introspective Cluster Scheduling for Deep Learning Slides Project introduction due.
SQL Frameworks
10/22 Spark SQL: Relational Data Processing in Spark Slides
10/24 Global analytics in the face of bandwidth and regulatory constraints Slides
10/29 Midterm 1 In-class midterm
Stream Processing
10/31 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Slides
Slides+Notes
11/5 Guest lecture
11/7 Naiad: A Timely Dataflow System Slides
Slides+Notes
11/12 Discretized Streams: Fault-Tolerant Streaming Computation at Scale Slides
Slides+Notes
Graph Processing
11/14 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs Slides
Slides+Notes
11/19 GraphX: Graph Processing in a Distributed Dataflow Framework
Scalability! But at what COST? (Optional)
Project check-in
New Hardware Models
11/21 Weld: A Common Runtime for High Performance Data Analytics
11/26 Occupy the Cloud: Distributed Computing for the 99%
11/28 Happy Thanksgiving!
12/3 In-Datacenter Performance Analysis of a Tensor Processing Unit
12/5 Review
12/10 Midterm 2
12/13 Poster presentation in CS building
12/17 Final project reports due