Parallelism in Workloads

Experiments

In this set of experiments, we quantify the amount of parallelism that exists in our workloads. We determine the optimal speed-up of each workload, assuming that it is given either 1, 2, 4, 8, or 16 workstations. We also determine the slowdown compared to the optimal case that occurs when a FCFS allocation scheme is used.

Workload scripts explicitly denote which jobs may be run in parallel. Our semantics are that all jobs in a parallel group must complete before the next parallel group can begin. Therefore, given an infinite number of processors, we can calculate the run time of the entire script by simply adding together the longest running job in each parallel group.

Optimal Parallelism

To determine the optimal speed-up, we assume that we know the execution time of each application. We use the execution time of the application when it is running sequentially; this is optimistic since the parallel execution may not exploit file cache locality (the results of a flushed file cache are explored in the next section). To determine the total execution time of a script on n processors, we sort in decreasing order the run times of the applications in a parallel group and assign the longest running job as each workstation becomes idle. This ensures that the work is load balanced across processors.

CPU Workload

The three graphs in this section show the speed-up for the CPU Workload with 1, 2, 4, 8, and 16 workstations. (Note that the scale on each graph is different.) All of the applications in the CPU Workload may be done in parallel; no dependencies exist. However, the ability of each workload to take advantage of additional workstations varies widely.

Many of the workloads (e.g., spice, mdljdp2, hydro2, ear, ora, and swm256) consist of only two or three jobs and in many cases, one of the jobs is significantly longer than the others. Therefore, these workloads can not effectively use more than two workstations, and still see speed-ups of less than 1.5. The other end of the spectrum is exemplified by the workloads in the last graph, tomcatv and doduc. These workloads consist of many job runs, with a sufficient number of short jobs to load balance the system.

I/O Workload

The I/O workload contains less parallelism than the CPU Workload because dependencies exist between many of the applications. The two exceptions are the runs of sim which have no dependencies. In these cases, we see best-case speed-ups of 5 (smaller runs) and 3 (longer runs).

Development Workload

Finally, the development workload is able to take advantage of parallelism when compiling a large number of files. (Note that not all development workloads are shown; we did not explicitly expose the parallelism in the makes for the some of the workloads.)

FCFS Allocation

We also examine the effects of assigning jobs in the order that the user submitted them to the system to workstations. In these graphs, we only look at the CPU Workload.

Clearly, workloads which saw no advantage from optimal parallelism, perform no worse when a less than optimal allocation strategy is used. However, workloads such as su2cor, mdljsp2, mdljdp2, and polmp, which contain between 3 and 6 jobs and saw speedups between 1.5 and 2.5, are sensitive to the allocation strategy when there are only two workstations. This sensitivity occurs because those workloads had large variations in the run times in the order that the user submitted those jobs. As we increase to 4 jobs, the allocation becomes more load-balanced; when as many workstations are available as parallel jobs, the fcfs allocation strategy is identical to the optimal case. The doduc workload is the only workload that sees a difference in performance between the fcfs and the optimal policies with a large number of workstations, due to its very large number of runs.

Conclusions

The amount of parallelism in each workload varies widely; the number of jobs that can be run in parallel vary between 1 and 34 (doduc). However, because of widely varying run times, load balancing the jobs is difficult; speed-ups greater than 6, regardless of the number of available processors, are rare. Since users requesting several processors see diminishing returns, if resources are scarce, then for best overall efficiency, one user should not be able to use many of the workstations.

FCFS is a reasonable allocation strategy in most cases; however, in a few cases, slowdowns near 20% were observed when only two workstations were used. Increasing the number of workstations to four eliminates most differences between the optimal and FCFS allocation.