In this section, we will treat hypercubic networks not only as direct networks, but also as multistage indirect networks (MIN), performing end-to-end permutations from the input nodes in the leftmost column to the output nodes in the rightmost column. The internal nodes are just switches. It is known that all the hypercubic MIN have equivalent end-to-end permutation routing properties. Without loss of generality, we may start with the ordinary butterfly to explain the main problems, and all the results mentioned here apply to all the hypercubic networks.
Back to the beginning of the page | Back to the CS838 class schedule |
Lemma
Assume greedy on-line oblivious deterministic routing in all-port SF ordinary butterfly oBFn. Then the number of communication steps for permutations bit-reversal and transpose is \Theta(\sqrt(N)).Proof
CAPTION: Bit-reversal permutation greedy routing on end-to-end butterfly oBF3. The middle-stage edges have congestion 2.
(0,u)-> (1,un-1... u1un-1)-> (2,un-1... u2un-2un-1)->...->
Let Uk=(k,un-1... uk+1ukuk+1... un-1), Uk+1=(k+1,un-1... uk+1ukuk+1... un-1), and ek=( Uk,Uk+1). We can easily observe that for given (k+1)-bit string un-1... uk, all packets sent from sources (0,un-1... uk *k) must pass through the channel corresponding to edge ek, since all these source nodes are leaves of CBTk with the root Uk. Let us call such a tree a left gather tree. The number of such packets is 2k=\sqrt(N/2), the first will cross ek in step k+1, and the last in step k+2k. This last packet then needs k steps to get into the rightmost column, so the lower bound on the number of steps of the routing algorithm is 2k+2k.
There are \sqrt(2N) left gather trees like this, each joined with the right half of the network via a single middle-stage edge, so that out of N middle-stage edges, only \sqrt(2N) are used for transferring N packets from the left to the right.
What are the memory requirements of an algorithm matching this lower bound? Assume \beta=O(2k). Consider any internal node in column i<= k of a left gather tree. It can accept 2 packets and send out one packet in one step, hence the largest accumulation of packets in its auxiliary buffers appears in step 2i-1 when all the routers in its left gather subtree has emptied their buffers, 2i-1 packets have gone, but 2i-1 are still stored. This holds for any i<= k, hence \beta <= 2k-1=\sqrt(N/8).
But it turns out that large buffers are not needed in fact, we can achieve the same lower-bound time with \beta=0. We just need a simple priority scheme to resolve the contention between two input packets over the single output channel to the parent router. For example, the packet with smaller source address has priority. If a packet moves up the tree, the whole chain of high priority packets from bellow moves one level up the tree. Consequently, the root of every left gather tree has in every step one packet to place on the middle-stage channel.
If n is even, the analysis is very similar and it gives nearly the same results --- the total number of steps is \sqrt(N)/2+n-1=\Theta(\sqrt(N)).
Hence, bit-reversal greedy permutation routing on a hypercubic topology performs equally badly with both large and constant buffers. The following lemma shows that the bit-reversal and the transpose permutations are (up to constant factors) examples of the worst-case permutations for greedy routing on hypercubic networks with nonconstant buffers.
Lemma
Let N=2n. Consider a routing problem in oBFn with at most one packet sent from each leftmost column node and at most one packet destined for each rightmost column node. If the channel buffer size is \Theta(\sqrt(N)), then any permutation can be routed in O(\sqrt(N)) steps using greedy algorithm.Proof
CAPTION: Worst-case time analysis of any permutation on oBFn.
Note that for small N=2n, say up to 128, there is almost no difference between \sqrt(N/2) and n=log N, so that problem becomes critical only for larger N. The worst-case permutations such as bit-reversal can be routed using precomputed instead of greedy routing (see the next subsection). Unfortunately, there are too many bad permutations for greedy routing so that we cannot precompute special routing for all of them.
Permutations with optimal greedy routing on oBFn
On the other hand, many useful permutations can be routed on a store-and-forward all-port oBFn in O(n) communication steps using greedy routing. We will show two important representants.
Packing routing problem A packing problem consists of routing packets from any subset M of input nodes into the first |M| output nodes so that the relative order of the packets does not change. So, the i-th packet within M is destined for output node i (see Figure ). Since initially input nodes from M do not know anything about the other nodes with packets, they must first learn its relative order within M in order to compute the destination address. They can compute this information using a parallel prefix sum algorithm which will be described in Section #21. They need 2n communication steps for that. Assume every active node in M knows its destination.
CAPTION: Contention-free routing for a packing problem on oBF3.
Lemma
The greedy routing provides contention-free paths for any packing problem.Proof
Spreading routing problem and reduction of routing to sorting Spreading is the reverse of packing, we just move packets from right to left. A spreading problem consists of routing M=|M| packets from a contiguous set of nodes in column n to any any subset M of nodes in column 0 so that the relative order of the packets does not change. Greedy routing will provide a contention-free shortest paths in the same way as in the packing (see Figure ). The most important application of spreading is its use in reducing a general permutation to sorting.
CAPTION: Contention-free routing for a spreading problem on oBF3.
Lemma
Any permutation can be routed on oBFn in time Ts(N)+O(n), where Ts(N) is the time to sort N items on oBFn.Proof
Monotone routing problem By combining one packing routing and one spreading routing, we can realize any monotone routing problem in O(n) steps on oBFn. A monotone routing is defined by giving an ordered set M1 of input nodes and ordered set M2 of output nodes, with |M1|=|M2|, such that the i-th node in M1 sends a packet to i-th node in M2. We perform packing of the packets from M1 by sweeping them left-to-right and then spreading by sweeping them right-to-left. If the packets are needed on the opposite side of oBFn, we transfer them using straight links. Figure shows monotone routing 0-> 2, 2-> 3,3->5,5->7 on oBF3.
CAPTION: A simple example of monotone routing as a composition of packing and spreading routing.
Monotone routing has a variety of applications, e.g., in load balancing of parallel computations.
Lemma
For any j, 0<= j<= N-1, the greedy routing provides node-disjoint paths for the translation permutation \pisj on oBFn, so that it is contention-free and takes n steps.Proof
CAPTION: Contention-free routing for translation by 3=011 on oBF3.
This argument becomes trivial in case of the complement permutation \pic defined on Page . Greedy routing guarantees the edge-disjointedness just because all the packets use only cross edges in all stages.
Permutation routing on binary hypercube
All the greedy permutation routing algorithms on SF all-port end-to-end oBFn are normal hypercube algorithms. Therefore, they can be executed on any other hypercubic network and on shuffle-exchange and de Bruijn graphs with only a constant slowdown. Any contention-free permutation routing on oBFn provides a congestion-free permutation routing on wormhole full-duplex all-port Qn!!! Figure shows the correspondence, compare with Figure .
CAPTION: Translation by 3=011 on full-duplex wormhole all-port Q3 using e-cube routing.
Back to the beginning of the page | Back to the CS838 class schedule |
CAPTION: 3-dimensional Benes network BN3.
Similarly to the ordinary butterfly, Benes network is hierarchically recursive. BNn contains two copies of BNn-1 as subgraphs, denoted as upper BNn-10 and lower BNn-11.
A key property of the Benes network is that it is rearrangeable complete network: for any permutation of inputs to outputs there is a contention-free routing. Let us state it formally in two lemmas.
Lemma
(Edge-disjoint routing.) Assume that each input/output node in BNn has two input/output channels, respectively. Let N=2n+1, N={0,...,N-1}, and let \pi:N->N be any permutation. Then in BNn, there is a set of N edge-disjoint paths joining input channels i to output channels \pi(i) for all iin\ Ns.
CAPTION: Edge-disjoint paths in BN3 for the transpose permutation routing.
Proof
By induction on n, based on the hierarchical recursivity of BNn.
Lemma
Node-disjoint routing. Let N=2n, N={0,...,N-1}, and let \pi:N->N be any permutation. Then in BN'n, there is a set of node-disjoint paths, joining every input channel i to output channel \pi(i) for all iin\ Ns.
CAPTION: Node-disjoint routing for a permutation on BN'3.
Proof
It is again an inductive proof similar to previous one, the only difference is in the definition of twins.
Lemma
Let N=n2n and N={0,...,N-1}. Consider all-port full-duplex direct wBFn with at most one packet per processor. Then for any permutation \pi:N->N there exists an off-line deterministic permutation routing for \pi on wBFn which needs at most 3n communication steps and only constant buffers.Proof
CAPTION: Last three phases of the off-line deterministic permutation routing on wBF.
This result has several important applications.
Lemma
Let N=n2n. Let G be arbitrary N-node all-port full-duplex SF interconnection network with maximum node degree d. Then the direct all-port full-duplex SF wBFn can simulate G with slowdown O(dlog N) if precomputation is allowed. .
CAPTION: Decomposition of a general communication step on an all-port full-duplex 4-node graph with maximum degree 3 into 3 permutations on wBF2.
Proof
Corollary
N-node all-port full-duplex wBFn can simulate N-node hypercube with slowdown O(log2N), if precomputation is allowed.Proof
Corollary
N-node hypercube can simulate any (Nlog N)-node bounded-degree network G with slowdown O(dlog N), d being the maximum node-degree of G, if precomutation is allowed.Proof
Back to the beginning of the page | Back to the CS838 class schedule |