Fast Parallel Sorting Under LogP: Experience with the CM-5
Fast Parallel Sorting Under LogP: Experience with the CM-5
In this paper, we analyze four parallel sorting algorithms (bitonic,
column, radix, and sample sort) with the LogP model. LogP characterizes
the performance of modern parallel machines with a small set of parameters:
the communication latency (L), overhead (o), bandwidth (g), and the
number of processors (P). We develop implementations of these algorithms
in Split-C, a parallel extension to C, and compare the performance
predicted by LogP to actual performance on a CM-5 of 32 to 512 processors
for a range of problem sizes. We evaluate the robustness of the algorithms
by varying the distribution and ordering of the key values. We also
briefly examine the sensitivity of the algorithms to the communication parameters.
We show that the LogP model is a valuable guide in the development of
parallel algorithms and a good predictor of implementation performance.
The model encourages the use of data layouts which minimize communication
and balanced communication schedules which avoid contention. With an
empirical model of local processor performance, LogP predictions closely
match observed execution times on uniformly distributed keys across a broad
range of problem and machine sizes. We find that communication performance
is oblivious to the distribution of the key values, whereas the local
processor performance is not; some communication phases are sensitive to
the ordering of keys due to contention. Finally, our analysis shows that
overhead is the most critical communication parameter in the sorting algorithms.