CS 744 Big Data Systems - UW Madison, Fall 2018

This class will introduce key concepts and state-of-the-art in big data systems. After covering the basics of modern hardware and software infrastructures that these systems leverage, we will explore the systems themselves from the ground up.

Specifically, topics we cover will include:

Logistics

Pre-requisites

Course prerequisites: The prerequisites for this course are Database Systems (CS 564 or CS 764) and Operating Systems (CS 537 or CS 736), or equivalent courses.

Grading

For more details on class presentations, paper reviews please see the lecture format page.

Schedule

Class Date Reading Lecture Material Notes
9/6 Introduction Shivaram: slides Fill out presentation preference form.
9/7 Assignment 0
9/11 The Datacenter as a Computer, Chapter 1 and 2
VL2: A Scalable and Flexible Data Center Network (Optional)
Presentation Tips
Slides
Storage Systems
9/13 The Google File System
Flat Datacenter Storage (Optional)
f4: Facebook’s Warm BLOB Storage System (Optional)
Shivaram
Arjun Balasubramanian
Aarati Kakaraparthy
9/17 Assignment 1 out
9/18 Bigtable: A Distributed Storage System for Structured Data
Dynamo: Amazon’s Highly Available Key-value Store (Optional)
Spanner: Google's Globally-Distributed Database (Optional)
Shivaram
Saurabh Agarwal
Adarsh Kumar
Computation Frameworks
9/20 MapReduce:Simplified Data Processing on Large Clusters
Dryad:Distributed Data-Parallel Programs from Sequential Building Blocks (Optional)
CIEL: a universal execution engine for distributed data-flow computing (Optional)
Shivaram
Yahn-Chung Chen
Roshan G Lal
9/25 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (Optional)
Encapsulation of parallelism in the Volcano query processing system (Optional)
Derek Paulsen, Huawei Wang
Scheduling
9/27 Borg: Large-scale cluster management at Google with Borg. See also Borg, Omega, and Kubernetes
YARN: Yet Another Resource Negotiator (Optional)
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (Optional)
Manjunath Shettar, Jayashankar Tekkedatha Submit project topics, group
9/28 Assignment 1 due. Assignment 2 out
10/2 DRF: Dominant Resource Fairness
Tetris:Multi-Resource Packing for Cluster Schedulers (Optional)
Quincy: Fair Scheduling for Distributed Computing Clusters (Optional)
Steve Wang, Sanchit Jain
Machine Learning
10/4 Towards a Unified Architecture for in-RDBMS Analytics
DimmWitted: A Study of Main-Memory Statistical Analytics (Optional)
KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics (Optional)
Derek Hancock, Yudhister Satija
10/9 Guest lecture on Scalable ML Algorithms Assignment 2 due.
10/11 Tensorflow: A system for large-scale machine learning
Ray: A Distributed Framework for Emerging AI Applications (Optional)
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems Also see document on programming style (Optional)
Sonu Agarwal, Mingren Shen
10/16 Scaling Distributed Machine Learning with the Parameter Server
STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning (Optional)
PipeDream: Fast and Efficient Pipeline Parallel DNN Training (Optional)
Srujith Poondla, Varun Batra
10/18 Clipper: A Low-Latency Online Prediction Serving System
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster (Optional)
Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems (Optional)
Siddhant Garg, Qinyuan Sun
SQL Frameworks
10/23 Spark SQL: Relational Data Processing in Spark
Impala: A Modern, Open-Source SQL Engine for Hadoop (Optional)
Dremel: Interactive Analysis of Web-Scale Datasets (Optional)
Yogesh Chockalingam,Philip Martinkus Project introduction due.
10/25 Global analytics in the face of bandwidth and regulatory constraints
TAG: a Tiny AGgregation Service for Ad-Hoc Sensor Networks (Optional)
CLARINET: WAN-Aware Optimization for Analytics Queries (Optional)
Abbinaya Kalyanaraman, Robert Claus
10/30 Trill: A High-Performance Incremental Query Processor for Diverse Analytics
Rethinking SIMD Vectorization for In-Memory Databases (Optional)
Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited (Optional)
Sri Harshal Parimi, Yanghui Kang
Stream Processing
11/1 Naiad: A Timely Dataflow System
Twitter Heron: Stream Processing at Scale (Optional)
Apache Flink™: Stream and Batch Processing in a Single Engine (Optional)
Zijun Ma, Akshaya Kalyanaraman
11/5 Midterm on 11/5 from 7.15pm to 9.15pm. Venue TBD
11/6 Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Drizzle: Fast and Adaptable Stream Processing at Scale (Optional)
Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems (Optional)
Kaushik Chandrasekhar, Samhith Venkatesh
11/8 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Realtime Data Processing at Facebook (Optional)
Aurora: a new model and architecture for data stream management (Optional)
Abhay Venkatesh, Rahul Jayan
Graph Processing
11/13 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
GraphX: Graph Processing in a Distributed Dataflow Framework (Optional)
Scalability! But at what COST? (Optional)
Bidyut Hota, Abhinav Garg
11/15 Arabesque: A System for Distributed Graph Mining
Fast and Concurrent RDF Queries with RDMA-based Distributed Graph Exploration (Optional)
ASAP: Fast, Approximate Pattern Mining at Scale (Optional)
Shuoxuan Dong, Yunang Chen
Monitoring, Debugging
11/20 Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Making Sense of Performance in Data Analytics Frameworks (Optional)
COZ: Finding Code that Counts with Causal Profiling (Optional)
Zi Wang, Anuja Golechha
11/22 Happy Thanksgiving!
New Hardware Models
11/27 Occupy the Cloud: Distributed Computing for the 99%
Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads (Optional)
Serverless Computation with OpenLambda (Optional)
Wen-Fu Lee, Chirayu Garg
11/29 FaRM: Fast Remote Memory
No compromises: distributed transactions with consistency, availability, and performance (Optional)
FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs (Optional)
Xiuyuan He, Xiaotian Li
12/4 In-Datacenter Performance Analysis of a Tensor Processing Unit
A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (Optional)
Strata: A Cross Media File System (Optional)
Venkatesh Somyajulu
12/6 "One Size Fits All": An Idea Whose Time Has Come and Gone
12/11 Big Data Systems: Looking into the future Shivaram
12/17 Final project reports due