Abstract: Many distributed algorithms are limited by the performance
of the slowest machine in the system. For instance, if an algorithm
requires all processes to reach a barrier before it can continue, then
the progress of the whole is determined by the last process to reach that
barrier. One possible solution to this problem would be to reduce
the load on the slower machines by giving some of their work to the faster
ones. In practise, however, this is very difficult to implement.
Deciding how to distribute the load often requires global knowledge, direct
supervision by an omniscient machine, and heavy calculation. Furthermore,
if we do decide to take a job from a slow machine and give it to a fast
machine, we must first guarantee that the latter machine has all of the
necessary resources (Data, information, communication ability, etc.) to
complete the job.
This paper examines a specific setting where workload redistribution
is desirable, describes a simple solution, and analyses its effectiveness.
The setting is a large parallel database in which 100 client machines work
concurrently to read a file that is stored across 100 servers. The
solution requires only local communication (each server must communicate
with two clients, and each client with two servers), and only very minor
calculations. It changes dynamically with time and can be shown to
converge on the best possible solution. We present it here in several
variations, and discuss the strengths and weaknesses of each.
Available as Maple Worksheet (This is more fun and interactive, but not easily printable. Open it in xmaple if you get a chance.)
Also available as: Postscript