Section#13: Alltoall broadcast algorithms in SF networks
(CS838: Topics in parallel computing, CS1221, Tue, Mar 16, 1999, 8:009:15 a.m.)
The contents
 SF switching with packet combining
 Allport fullduplex model
 Allport halfduplex model
 1port fullduplex model
 1port halfduplex model
 SF switching with no packet combining
 Allport fullduplex model
 Allport halfduplex model
 1port fullduplex model
 1port halfduplex model
In alltoall broadcast, also called gossip or total exchange, every node sends the same packet to every other node. It is used as a subroutine in many important parallel algorithms.
The complexity of AAB depends strongly on parameters of the communication subsystem and on the packet size. If packets are small, then the number of rounds should be minimized, using combining of packets. If packets are large, then a SF noncombining approach where packets are pipelined separately across the network is the best. In this section, we will explain the AAB algorithms with SF switching in both combining and noncombining cases.
SF switching with packet combining
Allport fullduplex model
In this most powerful and least realistic model, the AAB can be performed optimally using again a kind of trivial flooding (or greedy) approach. The number of rounds to complete an AAB on any network G is diam(G). In the first round, every node sends its broadcast packet to all its neighbors and in further rounds, it simply keeps sending all the new packets he received to all its neighbors. This protocol is clearly inefficient and produces a lot of redundant packets sent around the network. The redundancy can be decreased by constraining some parameters.
Allport halfduplex model
Since halfduplex channels can simulate fullduplex channels with slowdown two, the trivial upper bound on the number of rounds with halfduplex channels is 2diam(G).
A trivial AAB scheme under these conditions is called gatherscatter. One node accumulates all the packets and then broadcasts the accumulated information in one aggregate packet using OAB. The number of rounds is bounded by the complexity of OAB.
But in many cases, we can do better.
Lemma
Assume allport halfduplex combining model. If G is bipartite with diam(G)=D, then t_{AAB}(G)<= D+1.
Proof
Since G is bipartite, it has a vertex 2coloring. The AAB algorithm alternates oddnumbered rounds in which white nodes send information to black ones, with evennumbered rounds when the communication goes in the opposite way. Since black nodes start to disseminate the information in the second round, diam(G)+1 rounds are needed in the worst case.
CAPTION: AAB in a bipartite allport halfduplex network.
Figure shows oddnumbered and evennumbered round in a bipartite network.
Similarly, orientation of edges of G may help. If G is an undirected connected graph and we assign orientation to every edge, we get a digraph \vec{G} which had an oriented diameter D'. If there is an oriented path from x to y for any two nodes x and y, then \vec{G} is strongly oriented and the oriented diameter is finite.
Lemma
Assume allport halfduplex combining model. If diam(G)=D and diam(\vec{G})=D', then
D<= t_{AAB}(G)<= min(2D,D')
Proof
The lower bound is always the diameter. The upperbound 2D follows from the simulation of fullduplex channels by halfduplex ones with slowdown 2. But if AAB uses channels only in the direction implied by the orientation of \vec{G}, i.e., nodes just keep sending all the information they got form input channels to all output channels, then the number of rounds is D'.
The problem of enforcing orientation on edges of a graph so that the oriented diameter is minimum is hard in general. It has been solved for 2D meshes and tori, for some higherdimensional meshes and tori, and for hypercubes. All these topologies have orientation with diameter equal or nearly equal to the unoriented diameter. In case of 2D meshes, it is also called the Manhattan problem: Make the streets and avenues in Manhattan oneway so that the distances remain the same as before. The constructional proof of the Manhattan problem is very long and technical. The problem is trivial for de Bruijn and Kautz digraphs.
Here, we will show only the proof for the hypercube. The orientation is again constructed inductively. It follows from a more general construction.
Lemma
If a bipartite graph G has an orientation of diameter <= k, k>= 3, such that every vertex is in a cycle of length <= k, then graph G× K_{2}, the cartesian product of G and the complete graph on two vertices, has an orientation of diameter <= k+1 such that every vertex is in a cycle of length <= k.
Proof
Constructive. Take two copies of G, G_{1} and G_{2}, with 2coloring of vertices, assume black and white, and make G× K_{2} by joining every vertex in G_{1} with its image in G_{2}. The orientation of edges of this perfect matching is
from G_{1} to G_{2} iff the incident vertices are black ( Figure (a)).
Cartesian product preserves the length of oriented cycles in both copies, so that the induction hypothesis is preserved. We will distinguish only four cases, the other ones are symmetric.
 By induction, there is a path of length at most k+1 from any black vertex u in G_{1} to any vertex v in G_{2} (go to u', the image of u in G_{2}, and use induction to get from u' to v) ( Figure (b)).
 By induction, there is a path of length at most k+1 from any black vertex u in G_{2} to any white vertex v in G_{1} (use induction to get to v', the image of v in G_{2}, and go to G_{1}) ( Figure (c)).
 There is a path of length at most k+1 from any black vertex u in G_{2} to any black v vertex in G_{1}.
 If v is the image of u, then the path from u to its successor w in a cycle C\subset G_{2} of length <= k, to the image of w in G_{1}, and then to v using the image of C in G_{1} has length 2+(k1) ( Figure (d)).
 If v is not the image of u, let v' be its image in G_{2}. Then there is a oriented path from u to v' of length at most k in G_{2}. Let w be the vertex adjacent to v' on this path and let w' be its image in G_{1}. Then the path from u to v via w and w' has the length <= (k1)+2 ( Figure (e)).
To make the graph G× K_{2} bipartite, just flip the colors in one of its halves.
CAPTION: Induction step of orientation of graph G× K_{2}.
The result for the hypercube follows from the fact that diam(\vec{Q}_{2})=3, diam(\vec{Q}_{3})=5, and diam(\vec{Q}_{4})=4 (see Figure ).
Corollary
For n>= 4, diam(\vec{Q}_{n})=n.
CAPTION: The oriented diameter of Q_{4} is 4.
Also here, we can apply 1port gather/scatter. Since in the combining model, the complexity of the gather operation is the bounded by the complexity of the broadcast, we have
Lemma
For any Nnode graph G,
max(diam(G),[log N])<= t_{OAB}(G)<= t_{AAB}(G)<= 2t_{OAB}(G)
Note that we do not need fullduplex channels for this approach. However, for some topologies, there are much faster AAB algorithms, close to the lower bound max(diam(G),[log N]).
Lemma
Assume fullduplex 1port combining model. Then the number of rounds of AAB on meshbased topologies is the following.
 t_{AAB}(M(z))=
{z1 if z is even,
z otherwise.
}
 t_{AAB}(T(z))=
{z/2 if z is even,
(z+3)/2 otherwise.
}
 t_{AAB}(M(z_{1},...,z_{n}))=\sum_{i=1}^{n} t_{AAB}(M(z_{i})), t_{AAB}(T(z_{1},...,z_{n}))=\sum_{i=1}^{n} t_{AAB}(T(z_{i})).
 t_{AAB}(Q_{n})=n.
Proof
 Consider linear array M(z) of nodes i, i=1,...,z. The algorithm alternates oddnumbered rounds and evennumbered rounds, starting from round 1.
 In oddnumbered rounds, the exchange of information is between all oddeven pairs of nodes, i.e., between 1 and 2, 3 and 4, and so on.
 In evennumbered rounds, the exchange of information is between all evenodd pairs of nodes, i.e., between 2 and 3, 4 and 5, and so on.
The result is by a simple case analysis. See Figure for an example.
CAPTION: AAB on a fullduplex 1port combining linear array.
 1D tori T(z) are treated exactly the same way.
 z is even. Then these oddeven and evenodd pairs form always a perfect matching. Every packet proceeds by one hop in both directions in each round from its original source. The number of rounds equals the diameter.
 z is odd. Then always one node is out of game. This requires two more rounds than the diameter is. Figure gives the idea of an optimal schedule..
CAPTION: AAB on a 1port fullduplex cycle of length 5 in 4 rounds.
 AAB on higherdimensional 1port meshes and tori can be performed by using the cartesianproduct property. All the nodes exchange information along dimension 1 so that each of them knows the accumulated information for the whole linear array/cycle corresponding to dimension 1. Then the exchange proceeds along second dimension so that nodes belonging to the same 2D submesh/subtorus are mutually informed. And so forth. Since the overhead with combining the information together is ignored, the total number of rounds is just a plain sum of rounds of these phases.
CAPTION: Dimensionordered AAB in 1port fullduplex 2D tori.
 The algorithm is in fact a special case of that for meshes/tori. In round i, i=0,...,n1, the communication proceeds along all channels in dimension i which induces n edgedisjoint perfect matchings. 1port hypercube algorithm is optimal, the number of rounds matches the diameter, and allport assumption does not improve anything.
CAPTION: 3round AAB in 1port combining hypercubes. The node labels correspond to the information so far received by that node.
The size of messages doubles each round and so
t_{AAB}(Q_{n})=\sum_{i=0}^{n1}(t_{s}+2^{i}Mt_{m})=t_{s}n+Mt_{m}(2^{n}1)
The first term is a product of the diameter and startup time, the second term is the product of the number of nodes and onehop interrouter latency.
1port halfduplex model
The AAB algorithms for halfduplex meshbased topologies are very similar to those for meshes with fullduplex channels.
The best thing to do for linear arrays is to simulate fullduplex channels by alternating the direction of communication on halfduplex channels, the number of rounds just doubles ( Figure ).
For 1D torus T(z), we may use a similar approach, but surprisingly, this does not give optimal results. There is a better solution which is embarrassingly trivial:
pipeline all packets around the cycle in one direction.
The number of rounds is always z1, for both even and odd z. All the channels are used in all rounds which may intuitively explain better efficiency compared to simulation of fullduplex exchanges.
Algorithms for higherdimensional meshes/tori and hypercubes can be again derived from these basic ones using cartesianproduct property, similar to the approach shown on Figure . Consider 2D torus T(z,z) whose every node holds initially a packet of size M. In the first phase, each node collects packets within its horizontal cycle and ends up with a packet of size zM. In the second phase, every node participates in an AAB within its vertical cycle. Since all the communication is neighborneighbor, we may omit the internode latency t_{d} and the total time becomes
t_{AAB}(T(z,z))=(t_{s}+mt_{m})(z1)+(t_{s}+zmt_{m})(z1)=2t_{s}(z1)+mt_{m}(z^{2}1)
This trivial solution can be generalized to arbitrary meshes or tori, but it is not optimal. Optimal solutions are known, however, they are more complicated. Figure gives just an idea of an optimal strategy for 2D tori.
CAPTION: The basic pattern for AAB in a SF 1port halfduplex 2D tori, based on repeating matching patterns 0, 1, 2, and 3.
Allport fullduplex model
Each of N input packets must be delivered individually to all nodes. The number of packets moving in parallel across the network is much higher than in the combining model. Fullduplex channels are pairs of inverse simplex links, called arcs. Every node u of network G is the root of a broadcast spanning tree B(u). Each arch of B(u) is labelled with the number of the round in which the broadcast packet passes this arc. An arc with label i is said to be of level i. The height of B(u)), h(B(u)), is the highest arc label in B(u). Two trees B(u) and B(v) are said to be timearcdisjoint if for any i in {0,...,min(h(B(u)),h(B(v)))1}, the sets of arcs at level i in B(u) and B(v) are disjoint.
The motivation behind these definitions follows from the lockstep assumption: all the nodes start their broadcasts at the same time and the broadcasts proceed synchronously along the trees with same speed. At round i, only the arcs at level i are active in all the trees. If broadcast trees are timearcdisjoint, then all these parallel broadcasts are pairwise linkcontentionfree.
The lower bound on the number of rounds in network G with N nodes and minimum degree d is [(N1)/d]. Hence, it is [(z_{1}z_{2}1)/2] for M(z_{1},z_{2}), [(z_{1}z_{2}1)/4] for T(z_{1},z_{2}), [(2^{n}1)/n] for Q_{n}, and so on.
Under these assumptions, the problem of designing an optimal AAB algorithm reduces to the problem of designing TADBTs rooted in every node of the network! Here, we can nicely demonstrate how important is the symmetry of the underlying topology for solving such a problem. If the topology is vertexsymmetric, we may have a chance to find a generic structure of a TADBT, independent of the location of its root. This problem has been solved for 2D and 3D tori, 2D meshes, hypercubes, and some other Cayley graphs.
2D tori
2D tori are vertexsymmetric. The broadcast trees can be made isomorphic to each other, being translations of a generic tree, denoted B(*). The translation from node u to v of T(z_{1},z_{2}) is an automorphism pi_{u> v} of T(z_{1},z_{2}) defined by
pi_{u> v}: x> x+ v u
where + and  denote coordinatewise addition and subtraction modulo z_{1} and z_{2}. The translation preserves all the arc labels (i.e., arc time levels). It is easy to show that
Lemma
The broadcast trees made by translations of B(*) are pairwise TADBTs iff all the arcs at each level i in B(*) are of distinct directions.
In T(z_{1},z_{2}) there exist four directions N,E,W,S. It follows that B(*) provides an optimal AAB in T(z_{1},z_{2}) iff
 each level of B(*) consists of at most 4 arcs of distinct directions from the set {N,E,W,S}.
 h(B(*))=[(z_{1}z_{2}1)/u]
The structure of such a tree is trivial if z_{1}=z_{2} is odd. The B(*) consists of 4 snakes of size (z_{1}z_{2}1)/4, which are rotations of the same pattern and which cover completely the whole torus. Figure shows two solutions.
The AAB in such a torus is also memory optimal. Routers need no auxiliary buffers for storing packets: every packet is stored into the localnode memory and forwarded to one of output channels.
CAPTION: Two different TADBTs in T(7,7). The wraparound torus edges are omitted.
Time and memoryoptimal solutions exist for any 2D and 3D torus. in 2D case, a general T(z_{1},z_{2}) requires \beta<= 3, but in many cases, there is even a bufferless timeoptimal AAB, Figure shows a simple example.
CAPTION: An example of generic broadcast tree for a bufferless AAB in allport noncombining T(3,4).
A trivial solution exists for AAB in 3D cube T(z,z,z) if z is odd. The cube is partitioned into 6 pyramids, covered by isomorphic snakes. In general 3D T(z_{1},z_{2},z_{3}) requires \beta<= 60.
Hypercube
The same approach can be used in hypercubes, since Lemma can be easily generalized to any vertexsymmetric meshbased topology. We need to build a generic broadcast tree B(*) of Q_{n} such that the set of arcs at level i consists of n hypercube arcs of n distinct directions, except for the last level m=[(2^{n}1)/n] which consists always of less than n arcs (due to small Fermat theorem, 2^{n}1 is not divisible by n). Then the number of rounds will match the lower bound.
Similarly to the construction of the OAS tree in Section #12, without loss of generality, we can place the root of the generic tree into node 0^{n} and the construction of B(0^{n}) is based on partitioning the nodes of Q_{n} into necklaces and on arranging the nodes into an ncolumn table. In contrast to the OAS tree where each column consisted exclusively of nodes of one subtree of the OAS tree, here the requirement of timearcdisjointedness implies that
 row i of the table defines the tree nodes at level i (i.e., informed in round i),
 if node u=u_{n1}... u_{0} appears in column j, j=0,...,n1, then u_{j}=1,
 the AAB tree is constructed by connecting every node u=u_{n1}... u_{j+1}1u_{j1}... u_{0} in column j to node \non_{j}(u)=u_{n1}... u_{j+1}0u_{j1}... u_{0} (which is never in the same column as u).
The last condition implies that all input arcs of nodes on a given level have distinct directions. To make this possible, the arrangement of nodes into such an ncolumn table must guarantee that
 if u is in column j, j=0,...,n1, then node \non_{j}(u) appears at an earlier level (= row) than node u.
The construction is the following:
 All nonzero nodes are partitioned into equivalence classes R_{k}, 1<= k<= n1, consisting of nodes with k 1's, listed in ascending order of the number of 1's in their nodes: R_{1},...,R_{n1},R_{n}.
 Each R_{k} is split into necklaces L_{kj}, closed with respect to left rotation. Depending on the periodicity of nodes, each necklace consists of q nodes, where q=n (full necklaces) or q < n is a divisor of n (degenerate necklace).
 Necklaces L_{kj} are ordered within R_{k} so that the first necklace L_{k1} is always the the full necklace containing node 1^{k}0^{nk}. The remaining necklaces in each R_{k} can be taken in any order.
 Starting with L_{11}, we take necklaces in the order above and fill with their nodes the ncolumn table rowwise lefttoright so that each necklace which fills the table from column j is always rotated so as to start with a node u which has u_{j}=1. It can be any such node of the necklace except for necklaces L_{k1}, where the first node must have at position j1 (cyclically) zero bit. This condition is sufficient for meeting the requirement that node \non_{j}(u) is at a higher row in the table than node u, where j is the column of u.
Table shows one such partition of V(Q_{6}) into a 6column table. For example, the first necklace of R_{4}, L_{41}, starts in column 5 and therefore it must start with node u such that u_{5}=1 and u_{4}=0. The only node in this necklace with this property is node u=100111.

 {0}  {1}  {2}  {3}  {4}  {5}


1  L_{11}{:} {000001  000010  000100  001000  010000  100000}

2  L_{21}{:} {000011  000110  001100  011000  110000  100001}

3  L_{22}{:} {010001  100010  000101  001010  010100  101000}

4  L_{23}{:} {001001  010010  100100}  L_{31}{:} {111000  110001  100011

5  000111  001110  011100}  L_{32}{:} {101100  011001  110010

6  100101  001011  010110}  L_{33}{:} {001101  011010  110100

7  101001  010011  100110}  L_{34}{:} {101010  010101}  L_{41}{:} {100111

8  001111  011110  111100  111001  110011}  L_{42}{:} {111010

9  110101  101011  010111  101110  011101}  L_{43}{:} {110110

10  101101  011011}  L_{51}{:} {111101  111011  110111  101111

11  011111  111110}  L_{61}{:} {111111}   



CAPTION: Partition of V(Q_{6}) into a 6column 11row table. Every node u in row k and column j can be connected to a node \non_{j}(u) which always appears in row l, l < k.
Figure shows in detail the structure of the corresponding broadcast tree B(0^{6}) of height 11.
CAPTION: A timearcdisjoint spanning tree of Q_{6}.
Timeindependent gossip in symmetric networks
The AAB algorithms based on TADBTs enforce timedependent behavior on nodes:
the routing function depends on the round number. If the topology is node and edgesymmetric and the degree of nodes is even, say 2d, then we can decompose the arc set into d arcdisjoint hamiltonian cycles. This allows a fixed routing function in every node during the whole AAB. The advantage of this solution is that it is always bufferless and it was generalized to any hamiltonian higherdimensional torus, the disadvantage is that it cannot be timeoptimal if every node has initially only one packet.
Lemma
Consider a network G with degree 4. A timeindependent AAB algorithm on 2 edgedisjoint hamiltonian cycles with one packet per node requires at least [11N/401] rounds.
On the other hand, AAB on a 2D torus where every node initially holds 2 packets becomes timeoptimal. The solution is easy in case of evensized tori.
Lemma
If every node in T(z_{1},z_{2}) with z_{1},z_{2} even holds initially 2 packets, then there is a timeindependent AAB in z_{1}z_{2}/2 rounds.
Proof
Every node just sends packet 1 (2) to both directions along hamiltonian cycle 1 (2), respectively. Every packet makes z_{1}z_{2}/2 hops to reach all other processors. Every node receives 4 packets in every round and stores them into the local node memory. If c(u) denotes the column number of node u, then node u implements the following routing function for forwarding packets from input to output channels:
 N <> E and S <> W, if c(u)=z_{2}1 or c(u) is even,
 N <> W and S <> E otherwise (see Figure ).
Only in the last round, each node receives twice the same packet.
CAPTION: Two edgedisjoint hamiltonian cycles in T(4,6).
Again, we can simulate fullduplex AABs by doubling the number of step, which however in most cases does not provide optimal solutions. Here we will show an asymptotically timeoptimal AAB algorithm for 2D meshes which unfortunately is not memoryoptimal, it requires \beta=\sqrt{N} for an Nnode mesh. It consists of two phases. Let us define it for simplicity only for square meshes M(z,z) where z is even, since generalization to other 2D meshes is obvious.
 Apply 2coloring to partition nodes of T(z,z) into blue and green ones (see Figure ).
 Every blue (green) node holds a blue (green) packet.
 Phase 1: Blue nodes perform AABs of blue packets within rows and simultaneously green nodes perform AABs of green packets within columns. Blue nodes only store green packets and green nodes only store blue packets.
 The AABs take just z1 rounds, since there is no contention on halfduplex links.
 At the end of Phase 1, every node holds all z/2 blue packets from its row and all z/2 green packets from its column (see Figure ).
 Phase 2: All nodes perform row AABs of green packets and simultaneously column AABs of blue packets. These AABs consist of two parts:
 the first part is a trivial halfduplex simulation of fullduplex allport exchanges until the latest packet moving leftward passes the latest packet moving right in the middle of each row or column. If z is even, the fullduplex scheme would take z/2 steps, each consisting of z/2 rounds, so this halfduplex simulation takes z^{2}/2 rounds (see Figure (c)).
 the second part is just a straightforward pipelining of both packet waves towards the borders, which takes 2(z/21) rounds (see Figure (d)).
Phase 2 takes therefore z^{2}/2+n2 rounds.
The lower bound on the number of rounds is the ratio of z^{2}(z^{2}1), the total number of hops that must be performed, and 2(z^{2}z), the maximum number of hops per one round which M(z,z) can provide, i.e., (z^{2}+z)/2. This algorithm is therefore asymptotically optimal.
CAPTION: Allport halfduplex AAB in noncombining M(4,4) (a) Initial 2coloring. (b) Distribution of packets after Phase 1. (c) The first two steps of Phase 2 in the first row of the mesh. (d) The first row after the first part of Phase 2.
In this model, each node can only exchange exactly one packet with one of its neighbors. Several lower bounds apply here.
Lemma
For Nnode network G, the lower bound on the number of rounds of AAB, g(G), is
[ {N(N1)/ 2\mu(G)}]
where \mu(G) is the size of maximum matching in G.
Since for any Nnode graph G, \mu(G)<=[ N/2], we have
Corollary
g(G)>= N1.
Any hamiltonian graph can achieve this lower bound by alternating evenodd and oddeven pair exchanges between nodes along its hamiltonian cycle. It is similar solution to that in Figure .
Lemma
For Nnode network G=(V,E), let X\subset V be a vertex cutset of G whose removal disconnects G into the q connected components V_{i}. The the lower bound on the number of rounds of AAB is
[ \sum_{i=1}^{q}{max(V_{i},NV_{i})/M_{X}}],
where M_{X} is the maximum matching in joining X with V_{X}.
Proof
For each i=1,...,q, X must intermediate at least NV_{i} packets to inform V_{i} from VV_{i}, and inversely, X must intermediate at least V_{i} packets to inform VV_{i} from V_{i}, so at least max(V_{i},NV_{i}) packet must be relayed between V_{i} and X. Since X is shared by all components, at least \sum_{i=1}^{q}max(V_{i},NV_{i}) packet transmissions are needed between nodes of X and nodes of V_{X}=\cup_{i} V_{i}.
CAPTION: Lower bound on the number of rounds in 1port fullduplex noncombining AAB
Optimal or nearly optimal AAB algorithms are known for this weakest model, where every node can either send or receive one packet in one round. Since most parallel machines have more powerful communication capabilities and the optimal or nearly optimal solutions known for meshbased topologies are fairly complicated, we skip this part here.