Note's from Remzi class: ======================== INTRODUCTION * What is the motivation for this work? + large data, need to be done in sort of mount of time, + hence need parallelism (spread computation across machine) * What is the environment? + commodity PC + commodity disk + commodity network * What does the programming model hide? Messy details of distributed system: + failure handling + load balancing + data distribution + parallelization * What doesn't it hide? + User still need to specify number of map task, number of reduce task PROGRAMMING MODEL * At a high level, describe the programming model Map: - take input pair - produce intermediate key/value pairs - MapReduce lib groups together all intermediate values associated with same intermediate key I and passes them to Reduce function Reduce: - take intermediate key and a set of value pairs for that key - merge values belongs to the same key together Example: WordCount - Map: emit (w, 1) - Reduce: sum all count for a particular word * What types of application can be easily expressed as MR computations? + WordCount + Grep + Inverted Index + Distributed Sort ... any types of application that has some kind of scan and analysis data set ... * What types of application doesn't workout with MR? + random read, since now index (like DB), this is very slow IMPLEMENTATION * What are the steps to startup and running an MR job? (see below) * What does the value of M (number of mappers) influence? + Mapper, some disk IO to write intermediate files + N to N communication: a reducer talks to all Mappers --> network is the bottle neck Solution: more bandwidth + sometimes: reducers don't have to wait for all mapper to finish * What if M is too big? + too many mappers + data is really small + too much overhead per data piece * What if M is too small? + less parallelism + affect load balancing. E.g. 5 nodes, 5 map task + In the paper: Ideally, M and R should be much larger than the number of worker machines. Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails: the many map tasks it has completed can be spread out across all the other worker machines. * What if R is too big? + Large number of output files * Is it necessary to sort the data before giving it to the reducers? + Not really, reducers sort them any way * How is the final output stored? + in GFS * What types of failures is the system robust to? + worker failure handling: ~ using alive message to detect ~ if a worker fails, easy, just restart the tasks > for completed map task --> restart (since data is store locally) > for in progress map/reduce task --> restart > for completed reduce task --> NO need to restart (data is in GFS) + master failure: ~ can use checkpoint, and restart the master with the checkpoint * What would be the implication of using GFS instead of local disks? + don't have to restart the completed map task + but need some cleaning in GFS when done + useful when map task really take a long time * What if bad input repeatedly causes a task to crash? + just skip the bad record + when: bad query can crash all node + Solution: send query to subset of node ..., if all crash, return .. PERFORMANCE * What techniques do they use to help performance? + backup task + locality * How do they ensure that each worker has roughly the same amount of work? + configure M and R, much larger than the number of worker machines * What is the scale of the system used in the experiment? + 1800 machines * What are some performance problems that they have + startup overhead: for Grep experiment, take 150s, 60s of which is startup time, account for: ~ propagation of the program to all workers machine ~ delays interacting with GFS to open input files and to get information needed for locality optimization + network is bottleneck **MapReduce: Simplified Data Processing on Large Clusters** =========================================================== Note: may want to take a look at Jeff notes on this paper # Motivation: - distributed computing is hard for programmers if they need to care about: + parallelization + fault tolerance + data distribution + load balancing - How to make distributed computing easy? ==> MapReduce: new abstraction for simple computation that hides all sources of complexity above from programmers # MapReduce: new Programming Model that high the messy details of parallelism Map: - take input pair - produce intermediate key/value pairs - MapReduce lib groups together all intermediate values associated with same intermediate key I and passes them to Reduce function Reduce: - take intermediate key and a set of value pairs for that key - merge values belongs to the same key together Example: WordCount - Map: emit (w, 1) - Reduce: sum all count for a particular word # How does it works 1) MR lib at user program split the input file into M pieces 2) There will be M map task, and R reduce task, 1 master 3) Master picks worker for a task 4) Map worker - parse input piece - produce intermediate key/value pairs buffered in memory - write out intermediate key to *local* disk in to R partition (on the key) - once finish, tell master about the intermediate output 4) Reduce worker: - Told by master about the output location - Get these output by RPC - Sort them by key (because a partition may contains different key) - Scan the sorted data and apply reduce function - output is appended to a final output file 5) Once all done, notify the user # Fault-tolerance: - Worker failure: + detected using ping from master + in progress tasks (both map and reduce) are marked idle, hence will be re-executed by other worker + completed map tasks are re-executed because their output is unreachable (because it is stored on local disk) + completed reduce task do not need to be re-executed, because their output is stored on global file system (GFS) + workers executing reduce tasks are notified by re-execution of a completed map task (think Why? To know new position to update data) - Master failure: + checkpoints its data structure (info about tasks ...) # Semantics in the Presence of Failure - if Map and Reduce operators are deterministic + output is same (compare to the case with no failure) + hence, easy to reason about + rely on atomic commits of map and reduce task outputs ... - otherwise, weak consistency # Some Optimization for performance - Locality: to save network bandwidth, master try to schedule map task on a machine that has the input data - Backup task: to deal with "straggler" machine that takes unusually long time to complete one of the last few task (when this happen? a bad disk) ==>when close to complete, run back up tasks for remaining in-progress tasks # Refinements - why sorted output? useful for efficient random access lookup by keys - skips bad record: + deal with bugs that cause Map or Reduce functions crash deterministically + fix bug is the best, but sometimes it is not feasible (third party code)