CS 744 Big Data Systems - UW Madison, Spring 2025

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:

Course Learning Objectives

At the end of the course you will be able to

Logistics

Pre-requisites

Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.

Grading

Schedule

Class Date Reading Lecture Material Notes
1/21 How to read a paper Slides Slides+Notes Assignment 0
Infrastructure
1/23 The Datacenter as a Computer version 3, Chapter 1 and 2
Building Meta's GenAI Infrastructure (Optional)
Slides Slides+Notes
1/28 The Google File System
NFS: Sun's Network File System (optional)
Facebook's Tectonic Filesystem: Efficiency from Exascale (Optional)
Slides Slides+Notes Assignment 1
1/30 MapReduce:Simplified Data Processing on Large Clusters
MPI Tutorial Introduction and MPI Hello World
Dryad | DryadLINQ (Optional)
Slides Slides+Notes
2/4 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Ray: A Distributed Framework for Emerging AI Applications (Optional)
Slides Slides+Notes Assignment 1 due.
Assignment 2 out.
Machine Learning
2/6 PyTorch Distributed: Experiences on AcceleratingData Parallel Training
Tensorflow (Optional)
Towards a Unified Architecture for in-RDBMS Analytics (Optional)
Slides Slides+Notes
2/11 PipeDream: Generalized Pipeline Parallelism for DNN Training
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (Optional)
Slides Slides+Notes
2/13 Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
Gemini (Optional)
Slides Slides+Notes Assignment 2 due. Submit project form.
2/18 NanoFlow: Towards Optimal Large Language Model Serving Throughput
vLLM (Optional)
Slides Slides+Notes
Scheduling
2/20 Twine: A Unified Cluster Management System for Shared Infrastructure
Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade (Optional)
Slides Slides+Notes
2/25 Blox: A Modular Toolkit for Deep Learning Schedulers (Guest lecture) Slides Slides+Notes
2/27 No class
SQL Frameworks
3/4 F1 Query: Declarative Querying at Scale
Spark SQL: Relational Data Processing in Spark(Optional)
Slides Slides+Notes Project Introductions Due.
3/6 The Snowflake Elastic Data Warehouse
Building An Elastic Query Engine on Disaggregated Storage(Optional)
Slides Slides+Notes
3/11 Midterm 1 In-class midterm
Stream Processing
3/13 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Slides Slides+Notes
3/18 Apache Flink: Stream and Batch Processing in a Single Engine
State management in Apache Flink: Consistent stateful distributed stream processing(Optional)
Slides Slides+Notes
3/20 Discretized Streams: Fault-Tolerant Streaming Computation at Scale Slides Slides+Notes
3/25 Spring Break!
3/27 Spring Break!
Graph Processing, Recommendation Models
4/1 Marius: Learning Massive Graph Embeddings on a Single Machine Slides Slides+Notes Project check-ins due
4/3 AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data Slides Slides+Notes
4/8 BagPipe: Accelerating Deep Recommendation Model Training Slides Slides+Notes
Micro-services, New Hardware, ML for Systems
4/10 Occupy the Cloud: Distributed Computing for the 99% Slides Slides+Notes
4/15 In-Datacenter Performance Analysis of a Tensor Processing Unit Slides Slides+Notes Project check-in feedback
4/17 Sinan: ML-based and QoS-aware resource management for cloud microservices Slides Slides+Notes
4/22 Llumnix: Dynamic Scheduling for Large Language Model Serving Slides Slides+Notes
4/24 Midterm 2
4/29 Fairness and machine learning: Limitations and Opportunities (Introduction)
Measuring discrepancies in Airbnb guest acceptance rates using anonymized demographic data (Page 1-15) (Optional)
50 Years of Test (Un)fairness: Lessons for Machine Learning (Optional)
Slides Slides+Notes
5/1 Poster presentations
5/8 Final project reports due