CS 744 Big Data Systems - UW Madison, Fall 2021

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:

Cluster architecture
Big Data stacks: Hadoop, Spark
Scheduling and Resource Management
Machine learning
Batch and stream analytics
Graph processing
Serverless platforms

Course Learning Objectives

At the end of the course you will be able to

Explain the design and architecture of systems used for big data processing
Compare, contrast and evaluate research papers in the field of big data systems
Develop and deploy applications on a cluster of machines using existing big data frameworks
Design, articulate and report new research and development ideas in topics related to big data systems.

Logistics

Course Number: CS 744, Fall 2021, UW Madison
Instructor: Shivaram Venkataraman
Time: Tuesday and Thursday, 9:30AM - 10:45AM
Location: Engineering Hall 2317
Teaching Assistant: Yien Xu
Office hours:
- Shivaram Venkataraman - Thu 11am-12pm, CS 7367
- Yien Xu - Mon 5pm-6pm, Online
Discussion: We will be using Piazza for outside-class Q&A and to discuss papers. The system is highly catered to getting you help fast and efficiently from classmates, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza.
Text: There is no required text for this course. The lectures will be based on discussing research papers.

Pre-requisites

Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.

Grading

Class Participation: 10%
Paper reviews: 10%
Assignments: 20% (2 @ 10% each)
Two Midterms: 15% each
Final Project (in groups): 30%

Schedule

Class Date	Reading	Lecture Material	Notes
9/9	How to read a paper	Slides Slides+Notes	Assignment 0
	Infrastructure
9/14	The Datacenter as a Computer version 3, Chapter 1 and 2	Slides Slides+Notes
9/16	The Google File System NFS: Sun's Network File System (optional)	Slides Slides+Notes	Assignment 1
9/21	MapReduce:Simplified Data Processing on Large Clusters MPI Tutorial Introduction and MPI Hello World	Slides Slides+Notes
9/23	Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing	Slides Slides+Notes
	Scheduling
9/28	Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center YARN: Yet Another Resource Negotiator (Optional)	Slides Slides+Notes	Assignment 1 due
9/29			Assignment 2
9/30	DRF: Dominant Resource Fairness	Slides Slides+Notes
	Machine Learning
10/5	PyTorch Distributed: Experiences on AcceleratingData Parallel Training Towards a Unified Architecture for in-RDBMS Analytics (Optional)	Slides Slides+Notes
10/7	PipeDream: Generalized Pipeline Parallelism for DNN Training	Slides Slides+Notes	Submit project bids
10/12	Ray: A Distributed Framework for Emerging AI Applications	Slides Slides+Notes	Assignment 2 due
10/14	Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning	Slides Slides+Notes
10/19	Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis	Slides Slides+Notes
	SQL Frameworks
10/21	SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets Spark SQL: Relational Data Processing in Spark(Optional)	Slides Slides+Notes
10/25			Project Introductions Due
10/26	The Snowflake Elastic Data Warehouse Building An Elastic Query Engine on Disaggregated Storage(Optional)	Slides Slides+Notes
10/28	Midterm 1		In-class midterm
	Stream Processing
11/2	The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing	Slides Slides+Notes
11/4	Naiad: A Timely Dataflow System	Slides Slides+Notes
11/9	Discretized Streams: Fault-Tolerant Streaming Computation at Scale	Slides Slides+Notes
	Graph Processing
11/11	PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs	Slides Slides+Notes
11/16	GraphX: Graph Processing in a Distributed Dataflow Framework Scalability! But at what COST?	Slides Slides+Notes
11/18	Marius: Learning Massive Graph Embeddings on a Single Machine	Slides Slides+Notes
	New Data, Hardware Models
11/23	Occupy the Cloud: Distributed Computing for the 99%	Slides Slides+Notes
11/25	Happy Thanksgiving!
11/30	SplitFS: Reducing Software Overhead in File Systems for Persistent Memory	Slides Slides+Notes	Project check-ins due
12/2	In-Datacenter Performance Analysis of a Tensor Processing Unit	Slides Slides+Notes	Project check-in feedback
12/7	Midterm 2		In-class midterm
12/9	Fairness and machine learning: Limitations and Opportunities (Introduction) Measuring discrepancies in Airbnb guest acceptance rates using anonymized demographic data (Page 1-15) (Optional) 50 Years of Test (Un)fairness: Lessons for Machine Learning (Optional)	Slides Slides+Notes
12/14	Poster presentations
12/20			Final project reports due