CS 744 Big Data Systems - UW Madison, Fall 2019

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:

Course Learning Objectives

At the end of the course you will be able to

Logistics

Pre-requisites

Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.

Grading

Schedule

Class Date Reading Lecture Material Notes
9/5 How to read a paper Slides Fill out background survey
Assignment 0
Infrastructure
9/10 The Datacenter as a Computer version 3, Chapter 1 and 2 Slides
9/12 The Google File System
NFS: Sun's Network File System (optional)
Slides Assignment 1 out
9/17 MapReduce:Simplified Data Processing on Large Clusters Slides
9/19 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Scheduling
9/24 Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
YARN: Yet Another Resource Negotiator (Optional)
Assignment 1 due
9/26 DRF: Dominant Resource Fairness Assignment 2 out. Submit project topics, groups
Machine Learning
10/1 Towards a Unified Architecture for in-RDBMS Analytics
10/3 Scaling Distributed Machine Learning with the Parameter Server
10/8 Tensorflow: A system for large-scale machine learning
10/10 Ray: A Distributed Framework for Emerging AI Applications Assignment 2 due
10/15 Clipper: A Low-Latency Online Prediction Serving System
10/17 Gandiva: Introspective Cluster Scheduling for Deep Learning Project introduction due.
SQL Frameworks
10/22 Spark SQL: Relational Data Processing in Spark
10/24 Global analytics in the face of bandwidth and regulatory constraints
10/29 Midterm 1 In-class midterm
Stream Processing
10/31 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
11/5 Naiad: A Timely Dataflow System
11/7 Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Graph Processing
11/12 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
11/14 GraphX: Graph Processing in a Distributed Dataflow Framework
Scalability! But at what COST? (Optional)
11/19 Guest lecture Project check-in meetings
New Hardware Models
11/21 Weld: A Common Runtime for High Performance Data Analytics
11/26 Occupy the Cloud: Distributed Computing for the 99%
11/28 Happy Thanksgiving!
12/3 In-Datacenter Performance Analysis of a Tensor Processing Unit
12/5 Review
12/10 Midterm 2
12/13 Poster presentation in CS building
12/17 Final project reports due