Section#14: Alltoall scatter algorithms
(CS838: Topics in parallel computing, CS1221, Thu, Mar 18, 1999, 8:009:15 a.m.)
The contents
 Combining model with SF switching
 Hypercube
 1D torus
 Higherdimensional mesh/torus
 Combining model with WH switching
 Hypercube
 Meshes/tori
 Noncombining model with SF switching
 Allport fullduplex hypercube
 Noncombining model with WH switching
 Direct exchange AAS on hypercube
 Meshes/tori
In alltoall scatter, called also complete exchange or alltoall personalized communication, every node holds N1 packets of size \mu, one specific packet for every other node. The total number of packets is therefore N(N1) and the lower bound on the number of rounds is determined by the bisection width or the overall communication throughput of the network.
A typical application which requires AAS is a transposition of a matrix A_{N,N}, mapped rowwise to Nnode distributed memory machine. Each node i owns row i of A_{N,N} and an element a_{i,j} of A_{N,N}, 0<= i,j<= N1, is stored in i's local memory location j. It is easy to see that a parallel transposition of A_{N,N}, when elements a_{i,j} and a_{j,i} exchange their positions in distributed memory, corresponds exactly to an AAS. Each node i has to exchange an element with every other node k, i\not=k.
Hypercube
An obvious algorithm for AAS is called the standard exchange (SEX) and it is based on recursive dividing the cube into half subcubes. SEX in Q_{n} is just a sequence of n rounds, each implementing a perfect matching: in round i, i=1,...,n, every node takes all the 2^{n1}=N/2 packets destined for the nodes in the opposite subcube of dimension ni, permutes and combines them into one block, sends them to its counterpart in the matching, receives reciprocally its block, and adds the new packets to its pool of packets. The blocks of packets have the same size in all the rounds. The communication time is
t_{AAS}(Q_{n},\mu)=n(t_{s} + \mu t_{m}2^{n1}).
Figure shows the first two rounds of AAS on Q_{3}. Note that in both rounds, messages consist of 4 packets. In the first round, one node sends 4 distinct packets destined for the opposite 2cube. In the second round, each node sends its own two packets and 2 packets from its firstround partner to the opposite 1cube. In round 3, each node will send all 4 packets from its 2cube destined exclusively for its partner in the last perfect matching.
CAPTION: The first 2 rounds of a 3round AAS in a combining 1port hypercube Q_{3}.
1port halfduplex links
A naive trivial scheme, which is surprisingly optimal, is to pipeline cyclically all the messages. Each node forms a combined message of size (z1)\mu and sends it to its right successor. In each round k < z, each node receives from its left neighbor a combined message of size (zk+1)\mu, extracts from it one packet destined for himself, and forwards the remaining part of the message of size (zk)\mu to the right. The communication time is
t_{AAS}(T(z),\mu)=\sum_{i=1}^{z1}(t_{s}+\mu t_{m}(zi))=(t_{s} + \mu t_{m}z/2)(z1).
It can be shown that this is optimal with respect to communication bandwidth of T(z). The number of rounds equals to the diameter of oriented T(z). The total communication work is z times the work of one OAS, which is \mu t_{m}(\sum_{i=1}^{z1}i)=z(z1)\mu t_{m}/2 and this work must be carried by z edges. That is why wormhole routing requires the same number of rounds and cannot improve anything.
CAPTION: AAS on a 1port halfduplex combining SF cycle T(8)
Allport fullduplex links
Allport assumption allows to reduce the number of rounds and the total communication latency to at least one half, messages will be pipelined in both directions and will be smaller. For example, for z odd,
t_{AAS}(T(z),\mu)=\sum_{i=1}^{(z1)/2}(t_{s}+\mu t_{m}((z+1)/2i))=t_{s}(z1)/2 + \mu t_{m}z^{2}/8.
Higherdimensional mesh/torus
Lower bound on the communication latency
The lower bound for higher dimensional meshes can be derived similarly.
Lemma
Consider ndimensional T(z_{1},...,z_{n}) such that for all i, z_{i}>= z_{i+1}. The lower bound on the communication latency of AAS is
1/2[z_{1}/2] [z_{1}/2] \Pi_{i=2}^{n} z_{i}.
Proof
Cut T(z_{1},...,z_{n} into T_{1}=T([ z_{1}/2],...,z_{n}) and T_{2}=T([ z_{1}/2],...,z_{n}). The number of fullduplex channels joining T_{1} with T_{2} is 2\Pi_{i=2}^{n} z_{i}. The amount of data that must be exchanged between T_{1} and T_{2} is V(T_{1}).V(T_{2})=[ z_{1}/2][ z_{1}/2]\Pi_{i=2}^{n} z_{i}^{2}.
A usual approach is to apply cartesian product decomposition. Perform AAS in one dimension in all corresponding submeshes/subtori in parallel, then in second dimension, and so forth. For simplicity, consider only T(z,z). Each node must send z^{2}1 packets in total. First, nodes form z1 messages for one column each, and combine them together into messages of size z(z1)\mu, and pipeline them within all rows in parallel. Each receiver extracts z packets destined for its column, stores them, and forwards the rest. After the row AAS is finished, each node has packets from all z1 colleagues within its row plus its own packets destined for its column. So the column AAS has exactly the same complexity and communication pattern as the row ASS.
t_{AAS}(T(z,z))=2\sum_{i=1}^{z1}(t_{s}+\mu t_{m}z(zi))=(2t_{s}+\mu t_{m}z^{2})(z1).
Analogous algorithm for 2D mesh is similar.
Hypercube
Even with WH switching, we can apply the SF scheme from Subsection . Whether it will provide an optimal performance or not, depends on the ratio between t_{s} and \mu t_{m}. If t_{s}\approx \mu t_{m}, then SF algorithm will be optimal. Otherwise, distance insensitive routing can reduce the communication latency. Let us prove that SF is not optimal under these conditions.
In SF hypercube Q_{n}, 2^{n}(2^{n}1) packets of size \mu must be transmitted over distances from 1 to n and the average distance is n/2. Hence the total communication work is \mu t_{m}2^{n}(2^{n}1)n/2. The total number of arcs of Q_{n} is n2^{n}, and therefore the lower bound on the communication latency is
(\mu t_{m}2^{n}(2^{n}1)n/2)/(n2^{n})=\mu t_{m}(2^{n}1)/2,
which is n times less than the latency provided by SF perfectmatching algorithm in Subsection . There really exists an optimal AAS algorithm for WH hypercubes, but surprisingly, it does not use the packet combining! We will introduce it in Subsection .
Meshes/tori
Similarly to onetoall communication, a simple approach in lowdimensional WH meshes is to simulate AAS on SF hypercubes. Of course, it works best on meshes with side lengths equal to powers of two.
Binary exchange AAS (BEX)
The idea of the first solution for 2D mesh, called binary exchange (BEX), is depicted on Figure . Mesh \mu(2^{k},2^{l}) is recursively halved, alternately in the X and Y dimensions. If k=l, then X and Y dimensions are alternated till the end and the number of phases is k+l=log N. One phase requires more rounds, due to WH channel contention. In each round, a block of N/2 packets is exchanged between partners and packets must be then reorganized into new blocks exactly as in the hypercube SEX algorithm.
CAPTION: Binary exchange AAS on \mu(8,8).
If N=4^{k} is the size of \mu(2^{k},2^{k}), then the communication latency is roughly
t_{AAS}(M(\sqrt{N},\sqrt{N}),\mu)= \sqrt{N}t_{s}+2Nt_{d}/3+N\sqrt{N}\mu t_{m}.
Quadrant exchange AAS (QEX)
Another algorithm was designed specifically for WH meshes.
 Mesh M(2^{k},2^{k}) is recursively split into quarters, one splitting corresponds to one phase.
 A phase corresponding to quarters of size 2^{l}× 2^{l} consists of 2^{l} steps.
 In step i, 1<= i<= 2^{l}, all elements on ith diagonal of all four quarters perform a microAAS, represented by a rectangular in Figure (b,c).
 One microAAS consists in exchanging information among 4 corners of a rectangular, which takes 3 rounds, as follows form Figure (a).
CAPTION: The idea of the quadrantexchange AAS on WH mesh M(2^{k},2^{k}). (a) The 3round microAAS step between four corners of a rectangular. (b) 2 phases (2round and 1round ones) of AAS on M(4,4). (c) Phase 1 of QEX AAS consisting of 4 microAASs on M(8,8)
Allport fullduplex hypercube
In the following we will describe an both time and transmissionoptimal algorithm for AAS. The lower bound on the number of rounds is 2^{n1} (see the table in Section #11.3). This optimality can be achieved only if all n2^{n} hypercube arcs will be used at each of 2^{n1} rounds and if the data are routed always along the shortest paths. For simplicity assume that it is the matrixtransposition AAS. Hence, consider a matrix A_{N,N}, where N=2^{n}, stored rowwise in Q_{n} so that each node P_{i} owns row i of A_{N,N} and an element a_{i,j} of A_{N,N}, 0<= i,j<= N1, is stored in its local memory location j. Each node P_{i}, 0<= i<= N1, has to exchange an element with every other node P_{k}, i\not=k.
Let \hmd_{n}(i,j) denote the Hamming distance of vertices i,jin V(Q_{n}). Let (i,j) denote the distributedmemory address of a_{i,j}. Binary string i\XOR j will be referred to as the relative address of a_{i,j}. During the matrix transposition, an element a_{i,j} in P_{i} has to cross, in some, not yet known, order, \hmd_{n}(i,j) hypercube arcs to get to its destination memory location i in P_{j}. All the neighbortoneighbor exchanges of data across all communication links simultaneously represent one global communication round.
Since the destination and source locations have the same relative address, the basic idea of the AAS algorithm is
to preserve the relative addresses of packets in each intermediate round.
For example, if n=5, i=5, and j=22, then three communication rounds are needed to exchange a_{i,j} and a_{j,i}, since \hmd_{5}(5,22)=3. One possible relativeaddresspreserving path is (5,22)>(7,20)>(6,21)>(22,5).
In general, we need a scheduling table saying for every global communication round k in {1,..,2^{n1}} which relative addresses r_{k,0},...,r_{k,n1} are assigned to dimensions 0,...,n1, respectively.
In other words, in round k, all 2^{n} elements with the same relative address r_{k,i}, i in {0,..,n1}, will be exchanged simultaneously along all the 2^{n1} fullduplex links of direction i. Due to the allport assumption, we can do it for all n directions simultaneously. The schedule table has 2^{n1} rows (one row per one round) and n columns (one column per one direction). The elements r_{i,j}, 1<= i<= 2^{n1}, 0<= j<= n1, of the schedule table, are nbit binary nonzero strings. Let r_{i,j}[k], 0<= k<= n1, denote the kth bit of r_{i,j}. The schedule table must fulfill the following constraints:
 (a)
 for all i (r_{i,j}[j]=1). The jth bit of all relative addresses in column j is 1 (since column j contains relative addresses of matrix elements that must cross arcs of direction j).
 (b)
 for all i; for all j_{1}\not=j_{2} (r_{i,j1}\not=r_{i,j2}). Any relative address can appear at most once in each row (since data corresponding to one relative address can be sent in at most one direction in one communication round).
 (c)
 Every relative address r, r in {1,...,2^{n}1}, must appear exactly once in all the columns j, 0<= j<= n1, such that r[j]=1 (since an element must cross exactly once all hypercube arcs corresponding to unity bits in its relative address).
The design of an optimal AAS algorithm in this communication model is therefore reduced to the problem of constructing the schedule table fulfilling the conditions (a)(c). The table can be constructed for example in the following way.
Algorithm for constructing the schedule table:
Let m_{i}=2i+1, i=0,..,2^{n1}1, represented as an nbit string. Elements r_{i,j}, 0<= j<= n1, are constructed from m_{i} as follows:
 (i)
 r_{i,j}=\non_{j+1}(m_{i}) for 0<= j<= n2 and r_{i,n1}=m_{i};
 (ii)
 swap bits 0 and j in r_{i,j}.
Lemma
The schedule table constructed by this algorithm fulfills conditions (a)(c).
Proof
(a) Since 2i+1 is odd and in any round (i) we never invert bit 0, swapping bits 0 and j results in r_{i,j}[j]=1. (b) Any two numbers r_{i,j1} and r_{i,j2} differ in at least one bit due to inverting different bits in different columns in round (i). To prove that the condition (c) holds, it is sufficient to prove that r_{i1,j}\not=r_{i2,j} for all i_{1}\not=i_{2}. But this follows immediately from the fact that all m_{i}=2i+1, i=0,..,2^{n1}1, are distinct, and this property is preserved by inverting one bit and swapping two bits.
Table shows a schedule table for n=4.
Round  j=0  j=1  j=2  j=3

1  0011  0110  1100  1000

2  0001  0111  1110  1010

3  0111  0010  1101  1100

4  0101  0011  1111  1110

5  1011  1110  0100  1001

6  1001  1111  0110  1011

7  1111  1010  0101  1101

8  1101  1011  0111  1111


CAPTION: Optimal scheduling table for AAS in allport fullduplex SF Q_{4}.
We assume that the schedule table is precomputed and stored in local memories of all nodes. The AAS will consist of successive communication rounds i, 1<= i<= 2^{n1}. In every round i, each node P_{k} sends the number from its memory location r_{i,j}\XOR k along the communication link j, 0<= j<= n1, (hence to node P_{l} where l=\non_{j}(k)) and subsequently stores the number received from link j in the opposite direction into the same memory location r_{i,j}\XOR x.
Direct exchange AAS on hypercube
As mentioned in Subsection , if t_{s}<< \mu t_{m}, the standard exchange on Q_{n} with combining packets is not optimal and there exists asymptotically optimal algorithm on WH hypercube. It is called the direct exchange (DEX) algorithm: every pair of nodes exchanges directly its two packets. The algorithm is based on a fact that ecube routed hypercube can perform any permutation \pi_{j}: x> x\XOR j, where j is an nbit string, contentionfree in one round (we have proved this fact in Section #10). So that DEX consists of 2^{n}1 permutations \pi_{j}, j=1,...,2^{n}1. Each node executes the following algorithm in lockstep way.
for i=1,...,2^{n}1 do
begin
mate \leftarrow {my\_{i}d()} \XOR i;
exchange packets with mate using ecube routing;
]
where my\_{i}d() is a function which returns the address of the local hypercube node.
CAPTION: AAS on a noncombining WH hypercube
The communication latency of this solution is simply
t_{AAS}(Q_{n},\mu)=(2^{n}1)(t_{s}+nt_{d}/2+\mu t_{m})
which is only twice as large as the lower bound derived in Equation in Subsection .
Meshes/tori
Hypercube DEX algorithm can be of course simulated on meshes, especially if their sizes are powers of two. If we assign binary addresses to the mesh nodes, the algorithm uses the same permutations as in the hypercube, only instead of ecube we have XY routing. The only difference is that XY routing will produce link congestions for some permutations due to sparsity of mesh links. The maximum congestion in M(z_{1},z_{2}) in any step is max(z_{1},z_{2})/2. Figure shows 6 from 7 possible permutations on M(2,4). In three of them, middle horizontal arcs suffer from link congestion 2. The exact formula for the communication latency depends on the way how the communication system handles link congestion.
CAPTION: Direct exchange AAS on M(2,4).
Another algorithm is a little bit more involved. It also uses dimensionorder routing and it achieves asymptotically optimal number of rounds even though it needs only 1port capability! It is based on a similar idea as the quadrantexchange algorithm, however, it works optimally for a larger family of meshes and it is not combining.
CAPTION: Direct exchange AAS on a cycle.
First, we will explain the idea of the algorithm on 1D torus. For simplicity, assume T(z) with z=4t.
 arrange T(z) in an xyplane as in Figure (a). The X+Y+ quadrant is called quadrant I and the other three are denoted II, III, and IV anticlockwise.
 denote nodes in quadrant I by u_{1},...,u_{t}.
 denote by u_{i}^{x}, u_{i}^{y}, and u_{i}^{xy} the nodes symmetric to u_{i} with respect to axis x, y, and origin, respectively (see Figure (a)).
 for each of t(t1) possible pairs of distinct nodes u_{i} and u_{j} in quadrant I,
perform the following two contentionfree micropermutations (see Figure (b)).
 \pi_{1}(u_{i},u_{j}): {u_{i}>u_{j},u_{j}>u_{i}^{xy},u_{i}^{xy}>u_{j}^{xy},u_{j}^{xy}>u_{i};
u_{i}^{x}>u_{j}^{x},u_{j}^{x}>u_{i}^{y},u_{i}^{y}>u_{j}^{y},u_{j}^{y}>u_{i}^{x}}
 \pi_{2}(u_{i},u_{j}): {u_{i}>u_{j}^{x},u_{j}^{x}>u_{i}^{xy},u_{i}^{xy}>u_{j}^{y},u_{j}^{y}>u_{i};
u_{j}>u_{i}^{y},u_{i}^{y}>u_{j}^{xy},u_{j}^{xy}>u_{i}^{x} u_{i}^{x}>u_{j}}
 It follows that any possible pair of nodes on the cycle has exchanged directly packets except for quadruples u_{i},u_{i}^{x},u_{i}^{y},u_{i}^{xy}. This is performed now.
 for each of possible t nodes u_{i} in quadrant I, define u_{j}=u_{i+1 \mod t} and
perform the following two contentionfree micropermutations (see Figure (c)).
 \pi_{3}(u_{i}): {u_{i}>u_{i}^{x},u_{i}^{x}>u_{i}^{xy},u_{i}^{xy}>u_{i}^{y},u_{i}^{y}>u_{i};
u_{j}>u_{j}^{y},u_{j}^{y}>u_{j}^{xy},u_{j}^{xy}>u_{j}^{x},u_{j}^{x}>u_{j}}
 \pi_{4}(u_{i}): {u_{i}>u_{i}^{xy},u_{i}^{xy}>u_{i},u_{i}^{x}>u_{i}^{y},u_{i}^{y}>u_{i}^{x}}
Lemma
In T(z), this AAS takes 2[ z/2]^{2} rounds.
In higherdimensional meshes, this can be generalized using the cartesian product property.
A naive solution is to apply one horizontal and one vertical micropermutation in one round, but this decreases the efficiency. It follows that for M(z,z), where z=4t, up to t horizontal and vertical disjoint micropermutations can be interleaved simultaneously, so that all mesh links are used in each round, as is illustrated by Figure .
CAPTION: Interleaving of horizontal and vertical disjoint micropermutations in M(8,8)