CS 744 Big Data Systems - UW Madison, Spring 2025

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:

Cluster architecture
Big Data stacks: Hadoop, Spark
Scheduling and Resource Management
Machine learning
Batch and stream analytics
Graph processing
Serverless platforms

Course Learning Objectives

At the end of the course you will be able to

Explain the design and architecture of systems used for big data processing
Compare, contrast and evaluate research papers in the field of big data systems
Develop and deploy applications on a cluster of machines using existing big data frameworks
Design, articulate and report new research and development ideas in topics related to big data systems.

Logistics

Course Number: CS 744, Spring 2025, UW Madison
Instructor: Shivaram Venkataraman
Time: Tuesday and Thursday, 1.00PM - 2:15PM
Location: Engineering Hall 2535
Teaching Assistant: Tareq Mahmood
Office hours:
- Shivaram Venkataraman: Tue 3pm-4pm at CS 7367
- Tareq Mahmood: Mon 4pm-5pm and Thu 3pm-4pm at CS 3205
Discussion: We will be using Piazza for outside-class Q&A and to discuss papers. The system is highly catered to getting you help fast and efficiently from classmates, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza.
Text: There is no required text for this course. The lectures will be based on discussing research papers.

Pre-requisites

Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.

Grading

Class Participation: 10%
Paper reviews: 10%
Assignments: 20% (2 @ 10% each)
Two Midterms: 15% each
Final Project (in groups): 30%

Schedule

Class Date	Reading	Lecture Material	Notes
1/21	How to read a paper	Slides Slides+Notes	Assignment 0
	Infrastructure
1/23	The Datacenter as a Computer version 3, Chapter 1 and 2 Building Meta's GenAI Infrastructure (Optional)	Slides Slides+Notes
1/28	The Google File System NFS: Sun's Network File System (optional) Facebook's Tectonic Filesystem: Efficiency from Exascale (Optional)	Slides Slides+Notes	Assignment 1
1/30	MapReduce:Simplified Data Processing on Large Clusters MPI Tutorial Introduction and MPI Hello World Dryad \| DryadLINQ (Optional)	Slides Slides+Notes
2/4	Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Ray: A Distributed Framework for Emerging AI Applications (Optional)	Slides Slides+Notes	Assignment 1 due. Assignment 2 out.
	Machine Learning
2/6	PyTorch Distributed: Experiences on AcceleratingData Parallel Training Tensorflow (Optional) Towards a Unified Architecture for in-RDBMS Analytics (Optional)	Slides Slides+Notes
2/11	PipeDream: Generalized Pipeline Parallelism for DNN Training Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (Optional)	Slides Slides+Notes
2/13	Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures Gemini (Optional)	Slides Slides+Notes	Assignment 2 due. Submit project form.
2/18	NanoFlow: Towards Optimal Large Language Model Serving Throughput vLLM (Optional)	Slides Slides+Notes
	Scheduling
2/20	Twine: A Unified Cluster Management System for Shared Infrastructure Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade (Optional)	Slides Slides+Notes
2/25	Blox: A Modular Toolkit for Deep Learning Schedulers (Guest lecture: Saurabh Agarwal) Gavel (Optional)	Slides Slides+Notes
2/27	No class
	SQL Frameworks
3/4	F1 Query: Declarative Querying at Scale Spark SQL: Relational Data Processing in Spark(Optional)	Slides Slides+Notes	Project Introductions Due.
3/6	The Snowflake Elastic Data Warehouse Building An Elastic Query Engine on Disaggregated Storage(Optional)	Slides Slides+Notes
3/11	Midterm 1		In-class midterm
	Stream Processing
3/13	The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing	Slides Slides+Notes
3/18	Apache Flink: Stream and Batch Processing in a Single Engine State management in Apache Flink: Consistent stateful distributed stream processing(Optional)	Slides Slides+Notes
3/20	Discretized Streams: Fault-Tolerant Streaming Computation at Scale	Slides Slides+Notes
3/25	Spring Break!
3/27	Spring Break!
	Graph Processing, Recommendation Models
4/1	Marius: Learning Massive Graph Embeddings on a Single Machine (Guest lecture: Jason Mohoney)	Slides Slides+Notes	~~Project check-ins due~~
4/3	AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data (Guest lecture: Jason Mohoney)	Slides Slides+Notes
4/8	BagPipe: Accelerating Deep Recommendation Model Training	Slides Slides+Notes	Project check-ins due
	Micro-services, New Hardware, ML for Systems
4/10	Occupy the Cloud: Distributed Computing for the 99%	Slides Slides+Notes
4/15	In-Datacenter Performance Analysis of a Tensor Processing Unit	Slides Slides+Notes	Project check-in feedback
4/17	Sinan: ML-based and QoS-aware resource management for cloud microservices	Slides Slides+Notes
4/22	Llumnix: Dynamic Scheduling for Large Language Model Serving	Slides Slides+Notes
4/24	Midterm 2
4/29	Fairness and machine learning: Limitations and Opportunities (Introduction) Measuring discrepancies in Airbnb guest acceptance rates using anonymized demographic data (Page 1-15) (Optional) 50 Years of Test (Un)fairness: Lessons for Machine Learning (Optional)	Slides Slides+Notes
5/1	Poster presentations
5/8			Final project reports due