Section#15: Introduction to parallel sorting on mesh-based topologies
(CS838: Topics in parallel computing, CS1221, Tue, Mar 23, 1999, 8:00-9:15 a.m.)


The contents

  1. Introduction to comparison-based sorting
    1. Sorting networks
    2. Direct sorting
    3. Oblivious sorting and 0-1 lemma
  2. Mesh-based oblivious sorting
    1. Even-odd transposition sorting on 1-D meshes
    2. ShearSort on 2-D meshes
    3. Lower bound on 2-D mesh snakelike sorting
    4. Lexicographic sorting on 3-D meshes

Back to the beginning of the page Back to the CS838 class schedule

Introduction to comparison-based sorting

In Sections 15--18, we will discuss parallel deterministic sorting algorithms based on the compare-and-exchange (CE) operation. N will always denote the length of input data sequence. Sorting is one of fundamental problems in computer science. The lower bound on the number of CE operations to sort N elements is \Omega(Nlog N). Several optimal sequential algorithms are known, such as QuickSort or HeapSort.

A vast number of parallel sorting algorithms have been described in the literature and the state-of-the-art is briefly the following:

Parallel CE sorting algorithms are either PRAM or distributed memory ones. The latter can be implemented either as sorting networks or directly by imposing ordering on the nodes of an interconnection network and by sorting the numbers in the order complying to the numbering of nodes. This is easy if the topology allows hamiltonian path.

Sorting networks

CAPTION: Architecture of comparison-exchange sorting networks. (a) The default type of comparator (b) The second type of comparator. (c) Sorting network composed from columns of basic comparators.

Sorting networks consist of comparators, which are in fact hardware implementations of the CE operation. A comparator has two inputs and two outputs. Depending on the sense of ordering, comparators can be of two kinds, see Figure (a),(b). By default, we will assume case (a). A sorting network is an end-to-end network composed of columns of comparators (see Figure (c)). It is similar to a multistage interconnection network, except that the building block is a comparator rather than a 2× 2 switch. Unsorted input sequence X=x1,...,xN placed on input wires of the leftmost column of comparators passes through the network so that it becomes sorted output sequence (ascending, descending, bitonic) Y=y1,...,yN on output wires on the rightmost column.

The way how columns of comparators are interconnected determines the sorting algorithm. We will assume static networks with fixed interconnection of comparators. Then, each sorting network is a HW implementation of a sorting algorithm whose behavior is data-insensitive: the order in which the CE operations are applied to pairs of input elements does not depend on their input values. The sorting algorithm induced by the sorting network is called oblivious. Hence, an oblivious sorting algorithm performs the same steps for randomly generated input data as for completely sorted data. Terms ``oblivious sorting algorithm'' and ``sorting network'' can be considered synonymous.

The number of parallel CE steps of an oblivious sorting algorithm is given by the depth of its sorting network. Input wires of a network have depth 0. If a comparator in the network has input wires of depth d1 and d2, then its output wires have depth max(d1,d2)+1. The depth of a sorting network is the maximum over the depths of output wires.

If N is greater than the number K of input wires of the sorting network, then we assume that the input data set is split into N/K subsequences, which are then sorted. Comparators are equipped with buffers for N/K numbers and instead of CE operation, perform Compare-and-Split (CS) operation: two input sequences are merged and split again to lower and upper half. This takes O(N/K) steps.

Direct sorting

Given a network of p=N processing nodes, each with one input number. We can impose a total ordering on nodes by numbering them P1,...,PN. Then the sorting problem is equivalent to finding a unique permutation of input data such that Pi holds smaller number than Pi+1, using only exchanges of numbers between logically adjacent nodes (see Figure ). This kind of sorting is called direct network sorting.

CAPTION: The effect of a compare-and-exchange operation in direct networks, if x > y.

While designing a direct sorting algorithm, we must address the following 3 issues:

Oblivious sorting and 0-1 lemma

Oblivious sorting algorithms are easy to design and analyze due to so called 0-1 Sorting Lemma. It says that if a sorting network works correctly for any input sequence of 0's and 1's, then it works correctly on any input taken from any linearly ordered set, for example integers or reals. It is extremely useful for proving correctness and/or complexity of oblivious sorting algorithms. To prove it, we need an auxiliary result.

Lemma

Let f be a monotonically increasing function on a linearly ordered set S, i.e.,
for all  x,y in  S; x<= y iff f(y)<= f(y).
If a comparator network transforms an input sequence X=x1,...,xn of elements from S into an output sequence Y=y1,...,yn, then it transforms the input sequence f(X)=f(x1),...,f(xn) into the output sequence f(Y)=f(y1),...,f(yn).
Proof
(By induction on the depth of the network.)
  1. A single comparator is oblivious to f.

    CAPTION: Proof that one comparator is oblivious to any monotonically increasing function f.

    Hence, we have shown that

    for x,y in S, a comparator exchanges f(x) and f(y) iff it exchanges x and y.
  2. Induction step. Using induction on the depth of the network, we can prove even a stronger statement than is the lemma:
    If a wire in the network carries value xi when the input sequence is X, then it carries value f(xi) when the input is f(X).

Lemma

(The 0-1 Sorting Lemma.) If a comparison network with N inputs sorts all 2N possible sequences of 0's and 1's correctly, then it sorts all input sequences of arbitrary numbers correctly.
Proof
(By contradiction.)

CAPTION: A failure to sort X leads to a failure to sort a binary sequence.

Back to the beginning of the page Back to the CS838 class schedule

Mesh-based oblivious sorting

Even-odd transposition sorting on 1-D meshes

This algorithm is also called parallel bubble-sort. It takes precisely N steps to sort N numbers on M(N), which is optimal due to the diameter argument. The cost of the algorithm is therefore O(N2). The idea is obvious: alternate CE operations between odd-even and even-odd adjacent pairs of nodes. Let x1,..,xN be a sequence of numbers to be sorted rightward in ascending order.

for j=1...[ N/2] do_sequentially
     begin
       for i=1,3,..,2[ N/2]-1 do_in_parallel
           if xi > xi+1 then CE(xi,xi+1);
       for i=2,4,..,2[ (N-1)/2] do_in_parallel
           if xi > xi+1 then CE(xi,xi+1);
  end

Lemma

Even-odd transposition algorithm sorts N numbers on M(N) in N steps.
Proof
Using 0-1 Sorting Lemma, assume an input sequence of k 1's and N-k 0's, for any 1<= k<= N-1, arbitrarily scrambled. By induction on k, we can show that the algorithm will move the k 1's into positions N-k+1,..,N within N steps. Consider an even-odd transposition sort of N > p numbers on p processors. Since p nodes must perform sequentially p CS operations with N/p numbers and one such operation takes \Theta(N/p) CE operations in each processor and communication latency of an exchange of N/p numbers between adjacent nodes is also \Theta(N/p), the total parallel time is
T(N,p)=\Theta((N/p)log(N/p))+\Theta(N).
Hence, The scalability of even-odd transposition sort is therefore very poor, since to keep a constant efficiency, input data size must grow exponentially with the number of processors.

ShearSort on 2-D meshes

The simplest 2-D mesh sorting algorithm is called ShearSort and it is based on the even-odd transposition. It works for any 2-D mesh. Consider M(n,m), n rows and m columns.

for i=1,..., 2log n+1 do_sequentially
  if (i is odd) 
        then SORT_ALL_ROWS_SNAKELIKE ( Figure, Phase 1,3,5)
        else SORT_ALL_COLUMNS ( Figure, Phase 2,4)

CAPTION: Five phases of ShearSort on M(4,4).

Shearsort on M(n,m) consists of 2log n+1 phases. In odd-numbered phases, individual rows are sorted alternately rightward and leftward. In even-numbered phases, individual columns are sorted downward. In both cases, we can use even-odd transposition sorting. Figure shows five phases of ShearSort on M(4,4).

Lemma

ShearSort needs [log n]+1 row phases and [log n] column phases to sort nm numbers on M(n,m) in a snakelike order.One row (column) phase takes \Theta(n) (\Theta(m), respectively) computational and communication steps.
Proof
(By 0-1 Sorting Lemma.) It is noteworthy that if the rows were sorted in a row-major order instead of in a snakelike order, then the algorithm will not sort. Unfortunately, ShearSort is not optimal. Improvements based on decreasing number of dirty rows help only a little. In Section #16, we will describe an asymptotically optimal 2-D mesh sorting algorithm.

Lower bound on 2-D mesh snakelike sorting

The trivial lower bound on parallel sorting on a 2-D mesh M(n,n) is 2n-2, the diameter of the mesh. If the snakelike order is required, the lower bound on the number of CE steps for oblivious sorting on a 2-D mesh is greater than the trivial diameter-based one.

Lemma

Any oblivious snakelike sorting on mesh M(n,n) requires at least 3n-2\sqrt(n)-4 CE steps.
Proof
(Constructive.) Let N=n2. Assume any snakelike-order sorting algorithm A.

Lexicographic sorting on 3-D meshes

Consider the problem of sorting n3 numbers on M(n,n,n) in lexicographic zyx-order. Surprisingly, there is an algorithm which needs just five (!) phases, where one phase sorts numbers within 2-D planes. It works as follows.

Phase 1.  sort all xz-planes in zx-order;
Phase 2.  sort all yz-planes in zy-order;
Phase 3.  sort all xy-planes in yx-order snakelike (in direction y);
Phase 4.  perform one odd-even and one even-odd transposition within all columns in parallel;
Phase 5.  sort all xy-planes in yx-order.

Proof
(by 0-1 Sorting Lemma) Consider any input sequence of 0's and 1's.

CAPTION: 3-D mesh lexicographic sorting.

Back to the beginning of the page Back to the CS838 class schedule