Back to index
MapReduce: simplified data processing on large clusters
Jeffrey Dean and Sanjay Ghemawat
Google
One-line Summary
MapReduce is an easy-to-use parallel data processing model which hides the complex management of parallel schedule and execution diagrams, and failure handling from users.
Overview/Main Points
- The programming model
- input type of the model: a set of ‘key, value ’
- Map (key-value ➝ intermediate key-values)
- input type: ‘key, value ’
- output type: list of ‘intermediate key, value ’
- Reduce (groupBy and aggregate)
- input type: ‘(intermediate) key, a list of values ’
- output type: ‘key, value ’
- output type of the model: a set of ‘(intermediate) key, value ’
- Characters
- easy to program
- no side effect due to functional programming model
- easy for recovery (reexecution)
- The run-time system
- Infrastructure
- commodity machines: failures are common
- Storage: local and distributed (GFS)
- Data partition for Map tasks (16 MB - 64 MB per piece)
- Task scheduling
- master assigns Map and Reduce tasks to worker machines
- master maintains the task status (idle/in-process/completed)
- Push intermediate results (atomic commits and renaming) to Reduce workers
- Failure handling on a worker
- master pings each workers periodically to detect liveness
- For a completed map task: reschedule because the result stores in its local disks
- For a completed reduce task: nothing to do because the result stores in GFS
- For master failures: one idea is to recovery from checkpoints. On the other hand, just let users restart the MapReduce computation
- Communication
- Performance concerns
- schedule policy for good data locality
- data partition policy for good load balance
- bottleneck avoidance due to the master
- bottleneck due to extremely slow nodes: backup execution
- Refines
- Partitioning function for intermediate results by hashing
- Ordering guarantees by processing in ascendant key ordering
- Combiner functions
- Debug through local execution
- Skipping bad records
-
-
Relevance
Flaws