CS 744 Big Data Systems - UW Madison, Fall 2018

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:



Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.


For more details on class presentations, paper reviews please see the lecture format page.


Class Date Reading Lecture Material Notes
9/6 Introduction Shivaram: slides Fill out presentation preference form.
9/7 Assignment 0
9/11 The Datacenter as a Computer, Chapter 1 and 2
VL2: A Scalable and Flexible Data Center Network (Optional)
Presentation Tips
Storage Systems
9/13 The Google File System
Flat Datacenter Storage (Optional)
f4: Facebook’s Warm BLOB Storage System (Optional)
Arjun Balasubramanian
Aarati Kakaraparthy
9/17 Assignment 1 out
9/18 Bigtable: A Distributed Storage System for Structured Data
Dynamo: Amazon’s Highly Available Key-value Store (Optional)
Spanner: Google's Globally-Distributed Database (Optional)
Saurabh Agarwal
Adarsh Kumar
Computation Frameworks
9/20 MapReduce:Simplified Data Processing on Large Clusters
Dryad:Distributed Data-Parallel Programs from Sequential Building Blocks (Optional)
CIEL: a universal execution engine for distributed data-flow computing (Optional)
Yahn-Chung Chen
Roshan G Lal
9/25 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (Optional)
Encapsulation of parallelism in the Volcano query processing system (Optional)
Shivaram Venkataraman
Derek Paulsen
Huawei Wang
9/27 Borg: Large-scale cluster management at Google with Borg. See also Borg, Omega, and Kubernetes
YARN: Yet Another Resource Negotiator (Optional)
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (Optional)
Shivaram Venkataraman
Manjunath Shettar
Jayashankar Tekkedatha
10/1 Assignment 1 due
10/2 DRF: Dominant Resource Fairness
Tetris:Multi-Resource Packing for Cluster Schedulers (Optional)
Quincy: Fair Scheduling for Distributed Computing Clusters (Optional)
Shivaram Venkataraman
Sanchit Jain
Steve Wang
Machine Learning
10/4 Towards a Unified Architecture for in-RDBMS Analytics
DimmWitted: A Study of Main-Memory Statistical Analytics (Optional)
KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics (Optional)
Shivaram (Bismarck)
Shivaram (DimmWitted)
Yudhister Satija
Assignment 2 out. Submit project topics, group
10/9 Guest lecture on Scalable ML Algorithms
10/11 Tensorflow: A system for large-scale machine learning
Ray: A Distributed Framework for Emerging AI Applications (Optional)
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems Also see document on programming style (Optional)
Sonu Agarwal
Mingren Shen
10/16 Scaling Distributed Machine Learning with the Parameter Server
STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning (Optional)
PipeDream: Fast and Efficient Pipeline Parallel DNN Training (Optional)
Srujith Poondla
Varun Batra
Assignment 2 due
10/18 Clipper: A Low-Latency Online Prediction Serving System
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster (Optional)
Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems (Optional)
Siddhant Garg
Qinyuan Sun
SQL Frameworks
10/23 Spark SQL: Relational Data Processing in Spark
Impala: A Modern, Open-Source SQL Engine for Hadoop (Optional)
Dremel: Interactive Analysis of Web-Scale Datasets (Optional)
Yogesh Chockalingam
Philip Martinkus
Project introduction due.
10/25 Global analytics in the face of bandwidth and regulatory constraints
TAG: a Tiny AGgregation Service for Ad-Hoc Sensor Networks (Optional)
CLARINET: WAN-Aware Optimization for Analytics Queries (Optional)
Abbinaya Kalyanaraman
Robert Claus
10/30 Trill: A High-Performance Incremental Query Processor for Diverse Analytics
Rethinking SIMD Vectorization for In-Memory Databases (Optional)
Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited (Optional)
Sri Harshal Parimi
Yanghui Kang
Stream Processing
11/1 Naiad: A Timely Dataflow System
Twitter Heron: Stream Processing at Scale (Optional)
Apache Flink™: Stream and Batch Processing in a Single Engine (Optional)
Zijun Ma
Akshaya Kalyanaraman
11/5 Midterm on 11/5 from 7.15pm to 9.15pm. Venue 1221 CS
11/6 Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Drizzle: Fast and Adaptable Stream Processing at Scale (Optional)
Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems (Optional)
Kaushik Chandrasekhar
Samhith Venkatesh
11/8 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Realtime Data Processing at Facebook (Optional)
Aurora: a new model and architecture for data stream management (Optional)
Abhay Venkatesh
Rahul Jayan
Graph Processing
11/13 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
GraphX: Graph Processing in a Distributed Dataflow Framework (Optional)
Scalability! But at what COST? (Optional)
Bidyut Hota
Abhinav Garg
11/15 Arabesque: A System for Distributed Graph Mining
Fast and Concurrent RDF Queries with RDMA-based Distributed Graph Exploration (Optional)
ASAP: Fast, Approximate Pattern Mining at Scale (Optional)
Shuoxuan Dong
Yunang Chen
Monitoring, Debugging
11/20 Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Making Sense of Performance in Data Analytics Frameworks (Optional)
COZ: Finding Code that Counts with Causal Profiling (Optional)
Zi Wang
Anuja Golechha
11/22 Happy Thanksgiving!
New Hardware Models
11/27 Occupy the Cloud: Distributed Computing for the 99%
Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads (Optional)
Serverless Computation with OpenLambda (Optional)
Wen-Fu Lee
Chirayu Garg
11/29 FaRM: Fast Remote Memory
No compromises: distributed transactions with consistency, availability, and performance (Optional)
FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs (Optional)
Xiuyuan He
12/4 In-Datacenter Performance Analysis of a Tensor Processing Unit
A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (Optional)
Strata: A Cross Media File System (Optional)
Venkatesh Somyajulu
Derek Hancock
12/6 "One Size Fits All": An Idea Whose Time Has Come and Gone Shivaram
12/13 Poster session 3.30pm-5pm
12/17 Final project reports due