Note's from Remzi class:
========================

INTRODUCTION

* What is the motivation for this work?
	+ large data, need to be done in sort of mount of time, 
	+ hence need parallelism (spread computation across machine)

* What is the environment?
	+ commodity PC
	+ commodity disk
	+ commodity network
	
* What does the programming model hide?
	Messy details of distributed system:
	+ failure handling
	+ load balancing
	+ data distribution
	+ parallelization

* What doesn't it hide?
	+ User still need to specify number of map task, number of reduce task

PROGRAMMING MODEL

* At a high level, describe the programming model	

Map: 
	- take input pair
	- produce intermediate key/value pairs  
	- MapReduce lib groups together all intermediate values associated with
	  same intermediate key I and passes them to Reduce function
Reduce:
	- take intermediate key and a set of value pairs for that key
	- merge values belongs to the same key together

Example: WordCount
	- Map: emit (w, 1)
	- Reduce: sum all count for a particular word

* What types of application can be easily expressed as MR computations?
	+ WordCount
	+ Grep
	+ Inverted Index
	+ Distributed Sort
	... any types of application that has some kind of scan and analysis 
	data set ...

* What types of application doesn't workout with MR?
	+ random read, since now index (like DB), this is very slow

IMPLEMENTATION

* What are the steps to startup and running an MR job? (see below)

* What does the value of M (number of mappers) influence?
	+ Mapper, some disk IO to write intermediate files
	+ N to N communication: a reducer talks to all Mappers --> network
	  is the bottle neck
	Solution: more bandwidth
	+ sometimes: reducers don't have to wait for all mapper to finish

* What if M is too big?
	+ too many mappers
	+ data is really small
	+ too much overhead per data piece

* What if M is too small?
	+ less parallelism
	+ affect load balancing. E.g. 5 nodes, 5 map task
	+ In the paper: Ideally, M and R should be much larger than the number of 
	worker machines. Having each worker perform many different tasks improves 
	dynamic load balancing, and also speeds up recovery when a worker fails: the 
	many map tasks it has completed can be spread out across all the other 
	worker machines.

* What if R is too big?
	+ Large number of output files

* Is it necessary to sort the data before giving it to the reducers?
	+ Not really, reducers sort them any way

* How is the final output stored?
	+ in GFS

* What types of failures is the system robust to?
	+ worker failure handling:
		~ using alive message to detect
		~ if a worker fails, easy, just restart the tasks
			> for completed map task --> restart (since data is store locally)
			> for in progress map/reduce task --> restart
			> for completed reduce task --> NO need to restart (data is in GFS)
    + master failure:
		~ can use checkpoint, and restart the master with the checkpoint

* What would be the implication of using GFS instead of local disks?
	+ don't have to restart the completed map task
	+ but need some cleaning in GFS when done
	+ useful when map task really take a long time

* What if bad input repeatedly causes a task to crash?
	+ just skip the bad record
	+ when: bad query can crash all node
	+ Solution: send query to subset of node ..., if all crash, return ..

PERFORMANCE

* What techniques do they use to help performance?
	+ backup task
	+ locality

* How do they ensure that each worker has roughly the same amount of work?
	+ configure M and R, much larger than the number of worker machines

* What is the scale of the system used in the experiment?
	+ 1800 machines

* What are some performance problems that they have
	+ startup overhead: for Grep experiment, take 150s, 60s of which is startup
	time, account for:
		~ propagation of the program to all workers machine
		~ delays interacting with GFS to open input files and to get information
		needed for locality optimization
	+ network is bottleneck


**MapReduce: Simplified Data Processing on Large Clusters**
===========================================================

Note: may want to take a look at Jeff notes on this paper

# Motivation: 
- distributed computing is hard for programmers if they need to care about:
	+ parallelization
	+ fault tolerance
	+ data distribution
	+ load balancing
- How to make distributed computing easy?
==> MapReduce: new abstraction for simple computation that hides all
sources of complexity above from programmers

# MapReduce: new Programming Model that high the messy details of parallelism
Map: 
	- take input pair
	- produce intermediate key/value pairs  
	- MapReduce lib groups together all intermediate values associated with
	  same intermediate key I and passes them to Reduce function
Reduce:
	- take intermediate key and a set of value pairs for that key
	- merge values belongs to the same key together

Example: WordCount
	- Map: emit (w, 1)
	- Reduce: sum all count for a particular word

# How does it works

1) MR lib at user program split the input file into M pieces
2) There will be M map task, and R reduce task, 1 master
3) Master picks worker for a task
4) Map worker 
	- parse input piece 
	- produce intermediate key/value pairs buffered in memory
	- write out intermediate key to *local* disk in to R partition (on the key)
	- once finish, tell master about the intermediate output
4) Reduce worker:
	- Told by master about the output location
	- Get these output by RPC
	- Sort them by key (because a partition may contains different key)
	- Scan the sorted data and apply reduce function
	- output is appended to a final output file
5) Once all done, notify the user

# Fault-tolerance:
- Worker failure:
	+ detected using ping from master
	+ in progress tasks (both map and reduce) are marked idle, hence 
	will be re-executed by other worker
	+ completed map tasks are re-executed because their output is unreachable
	(because it is stored on local disk)
	+ completed reduce task do not need to be re-executed, because their output
	is stored on global file system (GFS)
	+ workers executing reduce tasks are notified by re-execution of a completed
	map task (think Why? To know new position to update data)
- Master failure:
	+ checkpoints its data structure (info about tasks ...)
	
# Semantics in the Presence of Failure
- if Map and Reduce operators are deterministic
   	+ output is same (compare to the case with no failure)
	+ hence, easy to reason about
	+ rely on atomic commits of map and reduce task outputs ...
- otherwise, weak consistency

# Some Optimization for performance
- Locality: to save network bandwidth, master try to schedule map task on a
 machine that has the input data
- Backup task: to deal with "straggler"
	machine that takes unusually long time to complete one of the last few task
	(when this happen? a bad disk)
	==>when close to complete, run back up tasks for remaining in-progress tasks

# Refinements
- why sorted output? useful for efficient random access lookup by keys	
- skips bad record:
	+ deal with bugs that cause Map or Reduce functions crash deterministically
	+ fix bug is the best, but sometimes it is not feasible (third party code)