[Architecture:
Compute+Overall (B+09)] The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines
, L.A. Barroso, U. Holzle, Synthesis Lectures on Computer Architecture, 2009. Chapter 1 and 2.
[Architecture:
Networks: Optional (G+09)] VL2: A Scalable and Flexible Data Center
Network, Greenberg et al., SIGCOMM 2009.
[Architecture:
Storage (S+10)] The Hadoop Distributed File System, Schvachko et al, MSST, 2010.
[Architecture:
Storage: Optional (GG+10)] The Google File System, Ghemawat et al, SOSP, 2003.
[Streaming: Heron Optional (KB+15)] Twitter Heron: Stream Processing at Scale, Kulkarni et al, SIGMOD, 2015.
[Streaming: SparkStreaming (ZD+13)] Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al, SOSP, 2013. Also read this introduction to Structured Streaming.
[Streaming: Flink: Optional (C+15)] Apache Flink: Stream and Batch Processing in a Single Engine, Carbone et al, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015.
[QMS: Kafka: Optional (K+11)] Kafka Distributed Messaging System for Log Processing, Kreps et al, NetDB Workshop, 2011. Also read this comparison of widely used Queuing Messaging Processing Systems.
[Streaming: rStreams (L+16)] StreamScope: Continuous Reliable Distributed Processing of Big Data Streams, Lin et al, NSDI, 2016.
Applications: Graph Processing
[GraphProc:Pregel (M+10)] Pregel: A System for Large-Scale Graph Processing, Malewicz et al, SIGMOD, 2010.
[GraphProc: PowerGraph: Optional (GL+12)] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al, OSDI, 2012.
[GraphProc: GraphX (G+14)] GraphX: Graph Processing in a Distributed Dataflow Framework, Gonzalez et al, OSDI, 2014.
[Streaming/Execution: Naiad: Optional (M+13)] Naiad: A Timely Dataflow System, Murray et al, SOSP, 2013.
Applications: Machine Learning
[ML: ParamServ (LA+14)] Scaling Distributed Machine Learning
with the Parameter Server, Li et al, OSDI, 2014.
[ML: STRADS Optional (K+16)] STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning, Kim et al, EuroSys, 2016.
[ML: SLAQ (Z+17)] SLAQ: Quality-Driven Scheduling for Distributed Machine Learning, Zhang et al, SoCC, 2017.
[ML: TensorFlow (AB+16)] TensorFlow: A System for Large-Scale Machine Learning, Abadi et al, OSDI, 2016.
[ML: Gandiva Optional (X+18)] Gandiva: Introspective Cluster Scheduling for Deep Learning, Xiao et al, OSDI, 2018.
[ML: Clipper (C+17)] Clipper: A Low-Latency Online Prediction Serving System, Crankshaw et al, NSDI, 2017.
[RL: Ray Optional (M+18)] Ray: A Distributed Framework for Emerging AI Applicationss, Moritz et al, OSDI, 2018.
Potpourri: Hardware, Serverless and Approximation
[Serverless: PyWren (J+17)] Occupy the Cloud: Distributed Computing for the 99%, Jonas et al, SoCC, 2017.
[Approx: BlinkDB (AM+13)] BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data, Agarwal et al, Eurosys, 2013.
[Hardware: TPU: Optional: (J+17)] In-Datacenter Performance Analysis of a Tensor Processing Unit, Jouppi et al, CIDR, 2017.