1/21 |
How to read a paper |
Slides
Slides+Notes
|
Assignment 0 |
|
Infrastructure |
|
|
1/23 |
The Datacenter as a Computer version 3, Chapter 1 and 2
Building Meta's GenAI Infrastructure (Optional)
|
Slides
Slides+Notes
|
|
1/28 |
The Google File System
NFS: Sun's Network File System (optional)
Facebook's Tectonic Filesystem: Efficiency from Exascale (Optional)
|
Slides
Slides+Notes
|
Assignment 1 |
1/30 |
MapReduce:Simplified Data Processing on Large Clusters
MPI Tutorial Introduction and
MPI Hello World
Dryad | DryadLINQ (Optional)
|
Slides
Slides+Notes
|
|
2/4 |
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Ray: A Distributed Framework for Emerging AI Applications (Optional)
|
Slides
Slides+Notes
|
Assignment 1 due. Assignment 2 out. |
|
Machine Learning |
|
|
2/6 |
PyTorch Distributed: Experiences on AcceleratingData Parallel Training
Tensorflow (Optional)
Towards a Unified Architecture for in-RDBMS Analytics (Optional)
|
Slides
Slides+Notes
|
|
2/11 |
PipeDream: Generalized Pipeline Parallelism for DNN Training
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (Optional)
|
Slides
Slides+Notes
|
|
2/13 |
Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
Gemini (Optional)
|
Slides
Slides+Notes
|
Assignment 2 due. Submit project form. |
2/18 |
NanoFlow: Towards Optimal Large Language Model Serving Throughput
vLLM (Optional)
|
Slides
Slides+Notes
|
|
|
Scheduling |
|
|
2/20 |
Twine: A Unified Cluster Management System for Shared Infrastructure
Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade (Optional)
|
Slides
Slides+Notes
|
|
2/25 |
Blox: A Modular Toolkit for Deep Learning Schedulers (Guest lecture)
|
Slides
Slides+Notes
|
|
2/27 |
No class
|
|
|
|
SQL Frameworks |
|
|
3/4 |
F1 Query: Declarative Querying at Scale
Spark SQL: Relational Data Processing in Spark(Optional)
|
Slides
Slides+Notes
|
Project Introductions Due. |
3/6 |
The Snowflake Elastic Data Warehouse
Building An Elastic Query Engine on Disaggregated Storage(Optional)
|
Slides
Slides+Notes
|
|
3/11 |
Midterm 1
|
|
In-class midterm |
|
Stream Processing |
|
|
3/13 |
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
|
Slides
Slides+Notes
|
|
3/18 |
Apache Flink: Stream and Batch Processing in a Single Engine
State management in Apache Flink: Consistent stateful distributed stream processing(Optional)
|
Slides
Slides+Notes
|
|
3/20 |
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
|
Slides
Slides+Notes
|
|
3/25 |
Spring Break!
|
|
|
3/27 |
Spring Break!
|
|
|
|
Graph Processing, Recommendation Models |
|
|
4/1 |
Marius: Learning Massive Graph Embeddings on a Single Machine
|
Slides
Slides+Notes
|
Project check-ins due |
4/3 |
AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data
|
Slides
Slides+Notes
|
|
4/8 |
BagPipe: Accelerating Deep Recommendation Model Training
|
Slides
Slides+Notes
|
|
|
Micro-services, New Hardware, ML for Systems |
|
|
4/10 |
Occupy the Cloud: Distributed Computing for the 99%
|
Slides
Slides+Notes
|
|
4/15 |
In-Datacenter Performance Analysis of a Tensor Processing Unit
|
Slides
Slides+Notes
|
Project check-in feedback |
4/17 |
Sinan: ML-based and QoS-aware resource management for cloud microservices
|
Slides
Slides+Notes
|
|
4/22 |
Llumnix: Dynamic Scheduling for Large Language Model Serving
|
Slides
Slides+Notes
|
|
4/24 |
Midterm 2
|
|
|
4/29 |
Fairness and machine learning: Limitations and Opportunities (Introduction)
Measuring discrepancies in Airbnb guest acceptance rates using anonymized demographic data (Page 1-15) (Optional)
50 Years of Test (Un)fairness: Lessons for Machine Learning (Optional)
|
Slides
Slides+Notes
|
|
5/1 |
Poster presentations
|
|
|
5/8 |
|
|
Final project reports due |