[Architecture:
Compute+Overall (B+09)] The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines
, L.A. Barroso, U. Holzle, Synthesis Lectures on Computer Architecture, 2009. Chapter 1 and 2.
[Architecture:
Networks: Optional (G+09)] VL2: A Scalable and Flexible Data Center
Network, Greenberg et al., SIGCOMM 2009.
[Architecture:
Storage (S+10)] The Hadoop Distributed File System, Schvachko et al, MSST, 2010.
[Architecture:
Storage: Optional (GG+10)] The Google File System, Ghemawat et al, SOSP, 2003.
[Streaming: Heron Optional (KB+15)] Twitter Heron: Stream Processing at Scale, Kulkarni et al, SIGMOD, 2015.
[Streaming: SparkStreaming (ZD+13)] Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al, SOSP, 2013. Also read this introduction to Structured Streaming.
[Streaming: Flink: Optional (C+15)] Apache Flink: Stream and Batch Processing in a Single Engine, Carbone et al, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015.
[QMS: Kafka: Optional (K+11)] Kafka Distributed Messaging System for Log Processing, Kreps et al, NetDB Workshop, 2011. Also read this comparison of widely used Queuing Messaging Processing Systems.
[Streaming: rStreams (L+16)] StreamScope: Continuous Reliable Distributed Processing of Big Data Streams, Lin et al, NSDI, 2016.
Applications: Graph Processing
[GraphProc:Pregel (M+10)] Pregel: A System for Large-Scale Graph Processing, Malewicz et al, SIGMOD, 2010.
[GraphProc: PowerGraph: Optional (GL+12)] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al, OSDI, 2012.
[GraphProc: GraphX (G+14)] GraphX: Graph Processing in a Distributed Dataflow Framework, Gonzalez et al, OSDI, 2014.
[Streaming/Execution: Naiad: Optional (M+13)] Naiad: A Timely Dataflow System, Murray et al, SOSP, 2013.
Applications: Machine Learning
[ML: ParamServ (LA+14)] Scaling Distributed Machine Learning
with the Parameter Server, Li et al, OSDI, 2014.
[ML: STRADS (K+16)] STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning, Kim et al, EuroSys, 2016.
[ML: TensorFlow (AB+16)] TensorFlow: A System for Large-Scale Machine Learning, Abadi et al, OSDI, 2016.
[DeepLearning: MS: Optional (C+14)] Project Adam: Building an Efficient and Scalable
Deep Learning Training System, Chilimbi et al, OSDI, 2014.
[ML: KeystoneML (S+17)] KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analystics, Sparks et al, ICDE, 2017.
[ML: Clipper (C+17)] Clipper: A Low-Latency Online Prediction Serving System, Crankshaw et al, NSDI, 2017.
[ML: SLAQ (Z+17)] SLAQ: Quality-Driven Scheduling for Distributed Machine Learning, Zhang et al, SoCC, 2017.
[GraphProc/ML: GraphLab: Optional (L+12)] Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, Low et al, VLDB, 2012.
Potpourri: Runtime, Serverless and Approximation
[Runtime: Weld: Optional: (P+17)] Weld: A Commom Runtime for High Performance Data Analytics, Palkar et al, CIDR, 2017.
[DeepLearning: DeepXplore: Optional: (PC+17)] DeepXplore: Automated Whitebox Testing of Deep Learning Systems, Pei et al, SOSP, 2017.
[Serverless: PyWren (J+17)] Occupy the Cloud: Distributed Computing for the 99%, Jonas et al, SoCC, 2017.
[Approx: BlinkDB (AM+13)] BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data, Agarwal et al, Eurosys, 2013.