Tasks in MapReduce are designed to be small. The granularity of the task sizes is such that there is not much variation in the work required for a task in the same category. To better understand this we need a basic idea of what kind of applications MapReduce is most useful in. Typically, MapReduce jobs are data intensive. As noted int he MapReduce paper, the nature of the computation is usually simple, what makes the job complex is the sheer size of the dataset and the fact that to be able to complete the job in a reasonable amount of time it has to be parallelized. With this basic premise set it is easy to see that the amount of work needed for a task is more or less proportional to the size of dataset that it has to perform the computation on. Thus, the smaller the size of the dataset (within certain bounds such as block sizes on disk etc.), the smaller the expected amount of work required for the task. Thus reducing the size of the dataset would also, in general, lower the variance amongst the tasks (in a given category).

MapReduce, further, also takes into account the location information of the input files into account, thereby ensuring that tasks run on nodes which provide easy(ier) access to the dataset chunk to be worked on. This greedy policy tries to remove locality from proving to be a significant variable on which the running time of a task depends, thus making the tasks more uniform and symmetric and so reducing the variance in their individual running times (in a given category).

The problem with jobs with high variance is that it is difficult to determine who are the stragglers if we restrict ourselves to a metric which takes the average over all running tasks. What we want is for tasks to be analyzed, dependent, on the *kind* of task it is. MapReduce already has a way to handle this situation, and it does this by grouping tasks into categories such as map or reduce. The problem arises when tasks within a category have significance variance amongst them. An easy solution (conceptually speaking) would be to provide an interface for the programmers to be able to give some hints as to the expected work involved in each task by allowing them to group them into groups. Then, stragglers would be determined by comparing to metrics averaged across such groups as opposed to categories such as map and reduce.

The assumption that tasks in the same category should take equal time is an assumption of the paradigm. Map Reduce expects the programmer to break down the tasks into equal, small chunks translating to equal amount of work. This assumption helps detection of stagglers using the progress rate metric, which is totally broken when these tasks are unequal or the underlying nodes are heterogeneous. Original map reduce, also executes redundant copy of jobs to ensure that heterogeneity of nodes does not violate this assumption. It also launches redundant jobs if it detects a staggler based on the metric described above.

The LATE scheduler, instead of launching stagglers on progress rate, launches them based on the time remaining, only on fast nodes. The system also uses caps to avoid overloading the system. This way, the speculation and redundant execution is more accurate and does not waste cycles.

One of the ways to handle high variance in work is to reduce the tasks into possible smaller subtasks. Having very short tasks does not let a machine have a high avg. throughput while having a very large task can have a convoy affect. A middle ground needs to be reached. Also, what can help is to execute jobs with similar job sizes at a time so that the speculator does not play havoc while detecting stagglers. Also, another solution is that the job tasks with more work can be annotated by the programmer so that the scheduler can execute them on fast machines only.

With map reduce, M runs of the map operation and R runs of the reduce operation. So the MapReduce library splits the map inputs evenly into M parts and feeds it to the different runs. Likewise the library splits the reduce inputs to R parts and feeds it to the different reduce runs. Thus each map job does approximately the same amount of work, and each reduce job does approximately the same amount of work.

The LATE scheduler is an enhancement to the handling of straggler tasks in the Hadoop scheduler. LATE particularly does 2 things of interest to improve performance. 1. it develops a more intelligent algorithm to identify straggler tasks, and thereby avoids un-necessary speculative execution and attains targeted speculative execution. 2. LATE takes into account node heterogeneity when assigning speculative executions, and ensures that speculative execution is done only on fast nodes.

Currently, LATE does not consider locality of storage of required input data when scheduling speculative execution of tasks. It only looks for fast machines. This is one potential front where the LATE scheduler can be enhanced to further improve its handling of straggler tasks; doing so can often save expensive network copy overheads. .

n a well-behaved typical MapReduce job, the input will be separated into roughly equal chunks for map workers, and the key space will be divided roughly equally among reducers, thus results in same amount of work in each category.

If some tasks have more work, then a few extra speculative tasks will be launched. The scheduler does not know the root cause is slower machine or longer task. LATE will limit the number of speculative tasks, so LATE can handle this with some performance penalty.

When the workload is divided and assigned to workers, the scheduler can control the number of workers each task use based on the estimated time the task needs. In heterogeneous environments, the scheduler can assign the longer tasks to faster nodes. So the scheduler still need to try to make all tasks proceed at a similar rate. Otherwise, the response time will be bad. No matter what the environment is, the scheduler need a better metric taking account the different length of each tasks. At least the length of the workload should occur in the metric, and probably we need to put it as the denominator while computing progress rate.

We can also add layers. We can add one layer to consolidate some nodes into one node, and assign tasks accordingly. We can also add another layer on the top of the basic MapReduce that assumes same amount of work, and this additional layer will try to ensure the workloads each final worker get will have the same length.

To get relatively well-balanced work, Map-reduce tries to partition the work into many small tasks. In this way, each Map or Reduce worker can performance several tasks and this could help to achieve load balancing as no worker will get a huge task. As mentioned in the MapReduce paper, it helps the failure recovery when the work is split into many small partitions.

The LATE scheduler does not address this assumption by assuming that the separation function for either Map or Reduce work can split the input into around equal size. In the case, this assumption is not true. Then the LATE scheduler just consider the worker with more work as a straggler and start some speculative tasks on other fast nodes. As mentioned in the paper, launching a few extra speculative tasks is not harmful as long as obvious stragglers are also detected.

There is no good way of solving the load balancing problem if the initial input partition is not well-balanced except for repartitioning the work of the node with more work left to do. When the scheduler detect that some node which is not executing slowly but still has lots of work left to do, then the scheduler can repartition the work for the node and send the partitioned works to other nodes. Either we need to do job migration or speculatively execute each repartitioned job on a certain node. In all, the main idea is to repartition the work.

MapReduce also relies on the assumption that tasks in the same category require amount of work since

1. Input is divided equally into chunks that are read by the clients running the map tasks - given an homogenous set of machines, the map tasks would take the same time on an average (the map-machines are selected with locality of data in mind)

2. The data produced by the map tasks are partitioned with the help of a hash function - given the nature of hash functions in general, we can assume they produce a uniform distribution about the key space.

LATE scheduler Tasks could require greater work/time to complete when (a) the nature of the input contained in that set triggers a path in the code that is inherently slow (or) (b) A faulty machine or background tasks that contend for resources Thus the LATE scheduler would possibly try to reschedule the "straggler" on a fast machine and hope it completes earlier. The scheduler has no idea about the map or reduce functions or about the exact nature of the workload and so such anomalies cant be avoided.

The other aspect of LATE that differentiates its behavior in this scenario is the fact that it measures the estimated time for completion and this would depend on the time required by completed tasks of the same category. So a task that takes more time that the average case, could have its estimated time for completion biased by the time taken by other tasks.

Changes to Map Reduce architecture Based on the time taken to complete the task, the master process can figure out slave processes that are slow - this knowledge can be used to subdivide tasks and help expedite the task. A task could be divided into sub-tasks that could be optionally offloaded in case the machine originally assigned to run it was too slow. The problem with this approach is that the Reduce phase copies the entire data to local disks and sorts it before applying the reduce function. To over come this subdivided tasks would have to be merged on the basis of the sort keys - which could probably be implemented by another map-reduce cycle (if necessary - as it isn't always important to have sorted data).

MapReduce tries to partition the input data and the intermediate keys into equal sized pieces (in the same category map or reduce) to ensure the assumption of tasks having same amount of work in the same category. Their partition function uses hashing on keys (or modified keys).

The LATE scheduler doesn't deal with the amount of work of tasks directly, since it uses the estimation of time left for tasks as the criterion of ranking tasks for speculative execution. However, its use of the SlowTaskThreshold helps prevent choosing tasks with more work but runs fast for speculative execution. If a task's progress rate is below SlowTaskThreshold, it won't be considered for speculation. If the task with more work runs faster than SlowTaskThreshold, it won't be considered. However, if it runs slower than SlowTaskThreshold, it will be considered together with small tasks, which is not fair since the time left for a big task can be much longer than that of small tasks not because it's running slower but because of the amount of work it has.

Instead of considering only the absolute estimated time left of tasks, I would add the consideration of task size (amount of work) into the ranking process of the scheduler and use the estimation of time left of tasks relative to task sizes. Specifically, we can divide the estimated time left ((1-ProgressScore)/ProgressRate) by the estimated amount of work left, e.g. (1-ProgressScore)/(ProgressRate*WorkLeft). The amount of work left can be estimated by observing the amount of work that has been performed and where the task is in its progress.

A MapReduce job ensures that the tasks in the same category require roughly the same amount of work. This is done by separating the input into equal chunks and further by dividing the key space among reducers. Since LATE focuses on the estimated time left rather than just the progress rate, it speculatively executes the task that is estimated to finish the farthest in the future. Hence, the assumption that all tasks will do the same amount of work is inherently absent in the LATE scheduler, cause it captures the estimated finish time of all tasks and prioritizes their speculation accordingly.

The MapReduce framework divides the input into M chunks (specifically M files) for the Map phase. Similarly for the Reduce phase, the intermediate buffered output is partitioned into R regions. However, this partitioning is completely independent of the logic of the map and reduce programs. The amount of work done by a map or reduce task is not just a function of the size of the input chunks (assumed by the MapReduce framework). It is also dependent on the relationship between the "content" of the chunk and the "logic" of the map/reduce task. Therefore, to make MapReduce sensitive to this relationship, it can learn from the chunks being processed and predict the relationship for future chunks. If it is found that processing a particular chunk may take less or more time, it can be accordingly merged or split further to ensure that all tasks do the same amount of work.

MapReduce breaks the work into chunks of a similar and predictable size (16-64MB). This is the size of a chunk in GFS, so it is possible to schedule a Map function on a node which holds the required chunk. If this is possible, then no network delays or failures external to the node running Map will affect the computation. Externally to Map, the Google cluster is virtually homogenous, so the assignment of a task to a particular node is inconsequential aside from locality of the files required. Lastly, as the MapReduce task nears completion, it spawns speculative tasks on the off-chance that one of the running map tasks is a straggler.

Zaharia et al claim that Hadoop (an openly usable MapReduce implementation) thinks too little about when and where to spawn backup processes, and that this comes back as a performance hit when the environment is heterogeneous in individual node performance. Particularly: one might spawn a backup task for a set of jobs would will not turn out to be the stragglers. LATE adds more logic to the backup process decision: it seeks the node which is making the least progress (measuring progress over many tasks) and which has the most work left to do. The task on this node may be slow for one of two reasons: First, it may be a slow node. This is the problem LATE seeks to address directly. Alternatively, the node may be running a disproportionately long task. LATE serves this case, but only as a causal result of that task taking longer to satisfy the "near to completion" heuristic.

If the amount of work in a task is truly unpredictable, having a heuristic such as LATE's is really the only way to go, so it seems. Once a long task is detected, it can be moved. If, on the other hand, there exists some heuristic by which we may guess at slow tasks before they are realized, then we can assign those tasks to the fastest nodes. With this we must add additional logic, too, to avoid misconstruing these fast nodes to be broken or stragglers if the work still takes too long. As long as these tasks are identified and assigned earlier than comparatively fast tasks, the length of any one MapReduce's "straggle" will be minimal.

MapReduce library splits the input data into M similar-sized pieces (where M is number of map tasks). These M tasks then processes (maps) the input data into R partitions (where R is the number of reduce tasks). The partitioning can be controlled by the user applications, but basically is done such that "similar" data gets into the same partition (to be processed by same Reduce tasks). This way, the amount of work (basically data) is kept about the same for each of the map tasks and each of the reduce tasks.

LATE scheduler gives special treatment to tasks that are slow. For nodes that have slow tasks, the LATE scheduler does not assign any more work (based on thresholds). This helps the slower tasks get resources exclusively, and complete faster (if possible).

One modification to the algorithm could be for tasks to provide an estimate to the master right after being assigned some work by the master. The tasks know better about the environment (system type, load, etc) so they will have a better sense of when they are going to complete the job. Master can then use this information to (1) reschedule the task immediately, if it is going to end too late OR (2) track the progress of the worker against the estimate that it provided

MapReduce is a framework for computation which was developed for simplifying development of parallel functions by hiding the vast majority of common work such as dealing with unreliability and enabling high performance. By relieving the programmer from such details of implementation, this widely applicable framework allows for quick and (relatively) easy development. Since this framework is used as the basis for many different programs, ensuring its performance is very important. It has been shown that scheduling of tasks can have a high degree of influence on the completion time of tasks, and it has further been found that the default scheduler does not perform well under heterogeneous environments and other specific conditions.

Hadoop, a prominent implementation of MapReduce, tries to ensure that no tasks limit the performance of the overall task by re- tasks in the hope that they will execute more rapidly elsewhere on the system. The method by which task re-execution is forced is based off a progress score, which essentially is a calculation of how much work a task has already accomplished. There are a number of phases through which a task must pass and each phase accounts for a fixed portion of the progress score. This method generally works well for Hadoop because a homogeneous environment is assumed and all phases generally take nearly the same amount of time: care is taken to ensure that all phases are run at full speed by splitting the task into equally sized chunks and by employing locality to ensure that all threads have similar access times to data.

The LATE scheduler developed by Zaharia et al. notes that the Hadoop scheduler does not perform well in heterogeneous environments. They do not directly address the issue of tasks with different amounts of work causing problems, but by addressing the issue of different machines having different processing speeds, manage to successfully deal with the issue surrounding differing amounts of work as well. Basically, the LATE scheduler replaces the progress score with an estimate of the finish time of a task and worries about re-executing the tasks that have a completion time that would have a severe negative impact on overall performance. A task that has twice the work of all others, everything else being constant, will have a finish time that corresponds to twice the overall length of the shorter jobs. Thus, if finish time is correctly calculated, task will be correctly identified and scheduled appropriately.

To ensure the assumption, MapReduce tpically rely on two points:

1. MapReduce assume that each MapReduce job is well-behaved. The operation density across all the data is uniformly distributed. 2. The splitting program will divide the data by chunks with equal size.

The metric the LATE algorithm uses is the estimated time to end. If a job requires more operation is running on a fast machine, it may still be scheduled for speculative execution. However, this does not violet their goal of minimizing the average reponse time. In this sense, I would argue that their algorithm is more sensitive to the non-linerity in the operation requirements of a job over time.

One possible way is to associated the data with the required work. Original MapReduce assume that the work on data is equally distributed. However, a weighting factor can be added. The splitting program, instead of dividing the data equally in size, could take the weight into account. Another to improve average response time would be consider the profile of each machine, and migrate some swap the heave job on to the faster machine using live migration. But it would be much more complicated.

To hold the assumption that tasks in the same category (map or reduce) require roughly the same amount of work, MapReduce keeps tasks small. The separation of input into equal chunks and the division of the key space among reducers ensures roughly equal amounts of work.

LATE scheduler does not directly address this assumption. For tasks with more work, LATE estimates the time to completion as: (1 - ProgressScore)/ProgressRate to rank the possible stragglers.

To better handle jobs with high variance in work across tasks, we should use the ProgressRatio, instead of ProgreeTime to rank the possible stragglers. For example, given two tasks, one has total time 10ms, and completed 5ms. Another has total time 5ms, and completed 1ms. Using LATE, the latter one will be chosen to re-execute, for it has less time till completion(4ms < 5ms). While, considering the ProgressRatio, the first task has another 50% to do, and the second task has 80% to do. We may choose the first one if we take ProgressRatio into account(50% < 80%).

The revised algorithm works as follows:

If a node asks for a new task and there are fewer than SpeculativeCap speculative tasks running:

- Ignore the request if the node's total progress is below SlowNodeThreshold.

- Rank currently running tasks that are not currently being speculated by estimated ratio left: 1 - ProgressScore.

- Launch a copy of the highest ranked task with progress rate below SlowTaskThreshold.

One problem is, the ProgressScore is not accurate in dividing reduce task into three equal progress score. Given a workload, it is possible to use machine learning approach to get the appropriate way to computer ProgreeScore.

In order to ensure the assumption holds, during the map phase, the input data is usually divided into equal-sized chunks; it seems fit the typical work carried by MapReduce so each map task would require same amount of work. In the reduce phase, the developer often provides a partitioning function that has priori knowledge of the intermediate key space, thus divide the workload into load-balanced pieces.

In the Hadoop framework with LATE scheduler, this assumption still holds, so a task with more work may lead to two situations. On the one hand, this task costs more time than normal tasks so that the progress rate would be low. In an extreme case, the progress rate may be below SlowTaskThreshold and the task would be launched in another node speculatively. On the other hand, since the task costs too much time, the total progress of the node that has carried (or is carrying) the task may be below SlowNodeThreshold, in which case this node would be considered as a straggler, even if it is actually a fast machine.

To handle this problem, the key point is to differentiate between tasks with more work and tasks running on stragglers. One possible way to deal with this is to record on the master node: the average time of all succeeded tasks in the system (global) and the average time of tasks finished on every node (per node). When a task is running on a node and the progress rate is very low, we could tell if the node is a straggler or the task requires more work by looking at the average time values recorded on the master node. If the average of this node is always higher than global average, it's probably that this node is a slow node. Otherwise, it might be the case that the task really has more work to do and we don't run it speculatively elsewhere.

First, on the Map category it is ensured by dividing the input in relatively small and equally-sized chunks of data. Then, since the map function is the same and disk access mostly local, thus reducing network access factors, the average time will be mostly equal.

On the Reduce category, ideally all tasks must go through three phases. Fetch the intermediate data, and sort it, where variances of time could occur, given a variance on size of the input data, but still relatively homogenous; and the reduce phase, where the function is the same for all nodes, but the execution time will subtle vary for one input to another, but as well relatively homogenous.

The LATE scheduler does not directly address this assumption. How does the LATE scheduler handle tasks with more work? How could you modify the scheduler (or any aspect of the MapReduce framework) to better handle jobs with high variance in work across tasks?

A Larger task will tend to take more time than the rest to process, thus it's possible to be tagged as a candidate to be speculated. By the assumptions it is being processed in a non-stagger node, if the task is speculated it'll result in "wasted time", since the node is performing well. Although in a positive aspect, it wouldn't be wrongly scheduled twice, since the algorithm doesn't schedule a previous speculated task.

One way to improve the process of uneven sized data will be between the Map and Reduce phase, where the data size is most likely to vary. The intermediate data size could have a threshold, which data size above of that value is split to be process by two or more nodes, and some signal given for the data to be merged or appended in some output file. The threshold value could be calculated by the master based on the intermediate data available. Of course, this will produce a vague value, in order to give quick responses as intended by the LATE algorithm.

How does a MapReduce job typically try to ensure this assumption holds true?

A mapreduce job primarily relies on the data splitting function which splits the input into equal chunks of work (for map tasks) and partitioning function which partitions the intermediate key space into equal chunks of work ( for reduce tasks). Also, there are combiner functions that combine lots of repetitions of intermediate keys from one map task into one intermediate key. Though combiner functions strive to make the reduce tasks equal , the combiner functions themselves make the map task more heavy. So, combiner functions are a tradeoff between "computing resource in the map workers" AND "the network bandwidth between map and reduce workers + computer resources in the reduce workers".

How does the LATE scheduler handle tasks with more work?

The Late scheduler monitors the rate of completion of tasks. So, bigger tasks on small workers will tend to be completed slowly and therefore will become late. And the LATE scheduler will reschedule a backup speculative task on another big worker available.

How could you modify the scheduler (or any aspect of the MapReduce framework) to better handle jobs with high variance in work across tasks?

I am addressing the above question and also exploring how high variance can be brought intentionally in the tasks so that heterogeneity of workers can be used beneficially.

I think the paper addresses the problem from a wrong perspective. The solution given in the paper kind of "removes the small workers outside the map reduce cluster logically in a crude sense" because the small workers will be late and their work will be given to a bigger worker thereby making the small worker's work useless. So, from a broad perspective the small worker's work is rendered useless.

However, the natural way of addressing heterogeneity is to make the small workers contribute usefully to the map reduce job. My suggestion will be to throw away the assumption of the map reduce paradigm to always split the tasks into equal sizes and to use the variance in the tasks as an opportunity to address heterogeniety in the workers.

My assumptions:

a) There is a way to gauge the task sizes. This is a reasonable assumption since the property of the splitting and partitioning functions will be well known in most cases and hence the generated task sizes can be measured. b) The load on the workers in the farm need to be constantly made available to the master. This is also a reasonable assumption when compared to the existing need for the master to always know the task progress.

Now, given that the scheduler knows the number and sizes of the tasks along with the load of the workers, the scheduler should try to allocate the tasks to the workers in proportion to their load. This way the heterogeniety will be used beneficially. Since we have this elegant approach , can we force variance in tasks somehow ? For map tasks, we can study how the splitting functions can be changed to achieve this. For reduce tasks, we can increase the value of R ( which is the mapreduce parameter deciding the number of reduce tasks) to be a bigger multiple of the number of reduce workers. This way, there will be plenty of smaller reduce tasks and they can be split across the heterogeneous workers easily by combining one or more of these smaller reduce tasks appropriately.

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next

Next