Asymptotically optimal mesh sorting algorithm

Section#16: Asymptotically optimal mesh sorting algorithm
(CS838: Topics in parallel computing, CS1221, Thu, Mar 25, 1999, 8:00-9:15 a.m.)

Let N=n². Consider M(n,n) of N nodes, each holding a number. Divide the mesh into n^1/2 submeshes M(n^3/4,n^3/4), called shortly blocks. One column of n^1/4 blocks is called a vertical slice and one row of n^1/4 blocks a horizontal slice. Let m=n^1/4 be the # of blocks in one slice (horizontal or vertical). Block (i,j) is the block in the i-th horizontal slice and j-th vertical slice.

The algorithm consists of 8 phases:

Phase 1: Sort snakelike individual blocks, all in parallel.
Phase 2: Permute n^3/4 columns of each vertical slice to distribute them uniformly among all m vertical slices. The i-th column of slice j, i=1,...,n^3/4, j=1,...,m goes into vertical slice (i\mod m) into position (j-1)n^1/2+(i\mod m) (see Figure ).

CAPTION: Uniform distribution of columns among vertical slices. It can also be defined as a shuffle operation if the vertices are denoted by binary labels.
Phase 3: Repeat Phase 1, i.e., sort snakelike individual blocks, all in parallel.
Phase 4: Sort individual columns top-down, all in parallel.
Phase 5: Sort all odd-even vertical pairs of blocks in all vertical slices snakelike (a pair of blocks is treated as one block of double height).
Phase 6: Do the same with even-odd pairs of blocks.
Phase 7: Sort individual rows of the whole mesh according to a global snakelike ordering.
Phase 8: Perform 2n^3/4 steps of odd-even transposition on this global snake.

Lemma

The algorithm sorts N=n² numbers on mesh M(n,n) into a snake-like order in
3n+o(n)
parallel steps.

Proof
(by 0-1 Sorting Lemma)

Every horizontal slice consists of m blocks and every vertical slice consists of m blocks.
After Phase 1, each block contains at most one dirty row (see Figure ).

CAPTION: The situation after Phase 1.
In Phase 2, all potential m dirty rows in each horizontal slice are uniformly distributed among all blocks of the slice. In the worst case, all such dirty rows may have the 1's starting at the same position modulo m in each block and so after the permutation one block can get by at most m 1's more than another one. Hence, after Phase 2, any two blocks within each horizontal slice can differ by at most m 1's.
It follows that the number of 1's between any two vertical slices can differ by at most n^1/2= m² 1's.

CAPTION: The situation after (a) Phase 3 and (b) Phase 4.
Since any two blocks within one horizontal slice can differ by at most m 1's and the length of one row in each block is n^3/4=m³, it follows that after Phase 3, each horizontal slice can have at most 2 dirty rows (see Figure (a)).
After Phase 4, each vertical slice can have at most m dirty rows, since it consists of m blocks, each with at most one dirty row (see Figure (b)).

CAPTION: The situation after (a) Phase 5 and (b) Phase 6.
Those m dirty rows can span the boundary between even-odd or odd-even pairs of vertically adjacent blocks (as shown for example in Figure (b)), therefore we need both Phase 5 and 6 (see Figure ).
After Phase 6, each vertical slice can have at most one dirty row (see Figure (b)).
We know that any 2 vertical slices can differ in at most n^1/2 1's. Since the rows in vertical slices have length n^3/4 > n^1/2, there can exist at most 2 global dirty rows in M(n,n) and the number of 1's in them can differ by at most n^1/2× m=n^3/4.
After Phase 7, if just one dirty line remains, we are done (as in our example).
If there are still 2 dirty lines, the upper one consists of 0's except perhaps for at most n^3/4 1's and the lower one consists of 1's except perhaps for at most n^3/4 0's.
Then Phase 8 must complete the sorting.

Proof of the parallel time complexity: Each block is mesh M(n^3/4,n^3/4). If we use Shearsort within blocks and even-odd transposition in rows and columns, we have

Phase 1: T₁(N)=n^3/4(2log n^3/4+1)=O(n^3/4log n)
Phase 2: T₂(N)=n
Phase 3: T₃(N)=T₃(N)
Phase 4: T₄(N)=n
Phase 5: T₅(N)=O(n^3/4log n)
Phase 6: T₆(N)=T₅(N)
Phase 7: T₇(N)=n
Phase 8: T₈(N)=2n^3/4

Hence the total parallel time complexity is \sum_i=1⁸ T_i(N)=3n+O(n^3/4log n)=3n+o(n).

Back to the beginning of the page

Back to the CS838 class schedule

Section#16: Asymptotically optimal mesh sorting algorithm(CS838: Topics in parallel computing, CS1221, Thu, Mar 25, 1999, 8:00-9:15 a.m.)

Section#16: Asymptotically optimal mesh sorting algorithm
(CS838: Topics in parallel computing, CS1221, Thu, Mar 25, 1999, 8:00-9:15 a.m.)