Section#11: One-to-all broadcast algorithms
(CS838: Topics in parallel computing, CS1221, Tue, Mar 2, 1999, 8:00-9:15 a.m.)


The contents

  1. Communication models
  2. Performance metrics
  3. An example of lower bound calculations
  4. Store-and-forward one-to-all broadcast
    1. 1-port vs.~all-port model
    2. EREW PRAM
    3. Hypercube
    4. Meshes
    5. Tori
  5. Wormhole switching one-to-all broadcast
    1. 1-port vs.~all-port model
    2. Hypercube
    3. One-port WH meshes
    4. One-port WH tori
    5. All-port WH meshes and tori
In one-to-all broadcast, one designated node, called source, has a packet which must be sent to all the other processors. It is the simplest and most fundamental collective communication operation. Nowadays, it is a standard component of parallel SW libraries and is supported in HW in many commercial parallel computers. OAB is used in many parallel algorithms, such as vector-matrix multiplication, Gaussian elimination and LU factorization, Householder transformations.
Back to the beginning of the page Back to the CS838 class schedule

Communication models

The design of optimal collective communication algorithms must take a number of constraints of the underlying network into consideration. In this text, we will model these constraints using the following parameters:
Back to the beginning of the page Back to the CS838 class schedule

Performance metrics

The efficiency of communication algorithms can be evaluated using several criteria. Their choice depends on the communication model and hardware.
Back to the beginning of the page Back to the CS838 class schedule

An example of lower bound calculations

Lemma

The lower bounds on the time and transmission complexity of four basic collective communication operations in all-port full-duplex SF noncombining Qn in constant-time model are given in the following table.
comm. operation # of rounds # of transmissions
OAB n 2n-1
AAB [(2n-1)/n] 2n(2n-1)
OAS [(2n-1)/n] n2n-1
AAS 2n-1 n22n-1
Proof
  1. SF OAB cannot take less rounds than is the value of the diameter. Since the broadcast packet has to be delivered to 2n-1 distinct nodes, the number of transmissions is at least 2n-1.
  2. During AAB, each node has to receive 2n-1 distinct packets. In the all-port model, it can receive at most n packets in one round. Thus, at least [(2n-1)/n] rounds are needed. The number of transmissions is at least 2n times the number of transmissions of one OAB.
  3. During OAS, the source has to inject the total of 2n-1 distinct packets and it can inject at most n packets in one round. There is n\choose k vertices of Qn in the distance k from the source. The number of transmissions cannot be less than \sumk=1nk{n\choose k}=n2n-1.
  4. AAS consists of 2n OASs running concurrently. So, the number of transmissions is 2n times the number of transmissions in one OAS, i.e., n22n-1. Full-duplex all-port Qn can perform n2n hops in one round. Thus, the minimum number of rounds is ((n/2)22n)/(n2n)=2n-1.
It is noteworthy that there exist optimal hypercube algorithms matching exactly these lower bounds and we will describe some of them in the following four sections.
Back to the beginning of the page Back to the CS838 class schedule

Store-and-forward one-to-all broadcast

Clearly, half-duplex channels are sufficient for OAB, full-duplex channels cannot improve anything.

1-port vs.~all-port model

Any all-port SF network has a trivial optimal OAB with the number of rounds equal to its diameter, called flooding algorithm. It works for hypercubic networks, mesh-based networks, and shuffle-based networks. The source sends the packet to all its neighbors, which make a copy of it and forward it to all its remaining neighbors. Even though the packet is never sent back, some nodes can receive the packet more than once if the graph contains cycles. To avoid the duplication, broadcast algorithms are designed to send the packet only along edges of spanning trees of minimum height, called shortest-path spanning trees or broadcast trees.

On the other hand, optimal OAB algorithms are harder to find for 1-port SF networks. The

trivial lower bound on the number of rounds of OAB in 1-port SF network G is max(diam(G),log |V(G)|),
since in one round, the number of informed nodes can double at most. It can be generalized to the case of a k-port G, the
trivial lower bound on the number of rounds of OAB in k-port SF network G is max(diam(G),logk+1 |V(G)|).
However, the problem to determine the optimal time complexity of OAB (= nontrivial lower bound) in 1-port SF network G is NP-complete. Fortunately, for most SF networks, optimal OAB algorithms are known.

EREW PRAM

CAPTION: EREW PRAM one-to-all broadcast with 8 processors.
Obviously, EREW PRAM is constrained in the same way as an 1-port network. In the first step, the source writes the data into a shared memory cell. The standard optimal algorithm then just doubles the number of copies of the data in the shared memory in every step, as implied by Figure . As a result, the number of informed processors doubles, too. Therefore, the number of steps matches the trivial lower bound [log p].

It is noteworthy that exactly the same doubling algorithm works for 1-port complete network.

Hypercube

Interestingly enough, the hypercube allows exactly the same optimal broadcast algorithm, since its symmetric and binary recursive topology fits perfectly to this communication scheme.

CAPTION: Spanning binomial trees in 1-port and all-port Q4.
The corresponding spanning tree of Qn is called n-level spanning binomial tree, SBTn. In fact, it is isomorphic to the D&C tree of Qn explained in Section #7. Figure illustrates the recursive structure of SBTn: SBT0 is a single node and SBTn is constructed from two SBTn-1's by linking their roots with a new edge and by making either of them the new root. Since the hypercube is both vertex and edge symmetric, we can place the root of a SBT in any hypercube node and use the hypercube dimensions in any order.

The SBT algorithm works optimally for both 1-port and {\em all-port hypercube.}
In Figure , the edge labels are the dimension numbers. Each vertex has a label corresponding to the round number in which it receives the packet. The number of nodes that receive the packet in round i is 2i-1 in the 1-port model, whereas n\choose i in the all-port model (therefore the name binomial tree). Clearly \sumi=1n2i-1=\sumi=1n{n\choose i}=2n-1. Both schemes are transmission optimal.

Meshes

An optimal OAB in a SF 1-D mesh is basically the flooding algorithm:
Send the packet to both directions in pipeline. In 1-port case, send it in one direction, then in the other, starting with the longer-path-ahead one (again the farthest-first strategy)
In n-dimensional meshes, dimension-order approach is necessary to avoid duplication. Every node uses dimensions in the same order, say 1,...,n, so that the communication pattern generates dimension-order spanning trees, which are basically generalizations of hypercube binomial trees. The downward (upward) direction of dimension i will be denoted i- (i+, respectively).

All-port meshes

The source can be any internal node. It will start OAB just by sending the packet to all output channels. Every other node will execute the following node algorithm:
if the packet arrived via channel of direction i, where 0<= i<= n-1,
then forward it in direction i further (if possible) and simultaneously send it
to all neighbors in directions j=i+1,..., n-1 upward and downward (if they exist)

The number of rounds is equal to the diameter at most, since it is again a flooding algorithm, just ``dimension-controlled''.

1-port meshes

The source can be again any internal node. It starts by sending the packet along the first dimension, first to the direction with the longer path ahead, and then it injects the packet sequentially into channels of further directions 2-,2+,...,n-,n+.
Every other node will execute the following node algorithm:
if the packet arrived via channel in dimension i, where 0<= i<= n-1,
then
{ if you are not the last node in direction i, then forward it in direction i further;
for j=i+1,..., n-1 do_sequentially
{ if you are not the first node in dimension j, then forward it in direction j-;
if you are not the last node in dimension j, then forward it in direction j+;
}

Figure depicts such a tree. The numbers within nodes are again the rounds in which the node receives the packet.

CAPTION: Dimension-order OAB tree in 1-port store-and-forward M(3,3,4).
The farthest-first approach provides tight time optimality, except for the case when the first dimension has odd size and the source is exactly in the middle of it. We can conclude that

dimension-order spanning trees give optimal OABs for both 1-port and {\em all-port meshes.}
A general formula for calculating the broadcast time on a SF mesh M(...) is
tOAB(M(...))<=(ts+td+mtm)diam(M(...)).

Tori

Tori are vertex symmetric, so that we can use
the mesh algorithm above with the source sitting in the middle of the mesh.
Without any precautions, we will have duplication of packets in final rounds in each cycle. There are two solutions:
Back to the beginning of the page Back to the CS838 class schedule

Wormhole switching one-to-all broadcast

1-port vs.~all-port model

The time lower bound in one-port WH network is log |V(G)|. The diameter lower bound does not apply, since the communication is distance-insensitive. 1-port meshes and tori can easily achieve this lower bound by simulating the hypercube broadcast scheme. On the other hand, the lower bound for all-port meshes and tori is much harder to achieve, we will describe some known results.

Hypercube

In 1-port Qn, WH routing does not provide any improvement compared to SF, since the SBTn-based SF algorithm with time complexity (ts+td+mtm)n is optimal. Under the all-port assumption, however the WH routing can provide improvement for larger n, since the lower bound on the number of rounds becomes [ n/log(n+1)] (the number of informed nodes can increase n times in every round). The first improvement is so called double tree algorithm (DTA). It is a simple variation of the SBT algorithm which requires [ n/2] rounds for all-port WH Qn. The idea is to send the packet to the node s', complement to the source s, in the first round, in parallel with sending the packet to all neighbors of s not involved in the communication with s'. Then both s and s' build partial SBTs as implied by Figure .

CAPTION: 2-round OAB in all-port WH Q4
DTA is really approximately twice faster on real machines than the standard SBT algorithm, but it is optimal only for n<= 6. An algorithm which is optimal up to a small multiplicative constant for n>= 7 is called Ho-Kao algorithm (HKA) and it is based on the following idea: divide recursively Qn into sufficiently small subcubes of nearly equal size and apply DTA inside these subcubes. HKA assumes e-cube routing with checking the bits left-to-right and uses ascending dimension-simple paths: Given a sequence of dimension numbers 0<= i1 < ... < id<= n-1, d<= n-1, an ascending dimension-simple path is any path u0,...,ud of nodes in Qn such that uj is obtained from uj-1 by complementing bit ij. It follows that all d paths u0-> uj are pairwise node-disjoint if e-cube WH routing is used, so that u0 can send d packets to all these destinations simultaneously. HKA constructs an ascending dimension-simple path u0,...,un such that Qn can be partitioned into n+1 disjoint subcubes Si, of equal or nearly equal size, such that subcube Si contains ui. Figure demonstrates the idea for n=7.

CAPTION: Time optimal OAB algorithm in all-port WH Q7.

One-port WH Meshes

For one-port WH meshes, there exists a simple optimal OAB algorithm. The copies of the packet are disseminated dimension by dimension (cartesian product property), but this time, every node must participate in broadcasting in one dimension the whole time, until all nodes in respective linear paths are informed, since the basic procedure is the recursive doubling instead of trivial SF pipelining. The linear paths are recursively split into halves and in each new half, a new source is chosen. Since WH switching is distance insensitive, this provides always the logarithmic number of rounds. A closer look reveals that it is basically a
simulation of the SF hypercube SBT-based algorithm.
Consider M(z1,...,zn) with source (a1,...,an). The broadcast tree develops in phases in the dimension order 1,...,n. In phase i, all informed nodes form submesh M(*i,ai+1,...,an). In phase i+1, all informed nodes become sources in linear arrays along dimension i+1. Since meshes are not symmetric, the behavior of the algorithm depends on the position of the source. Figure gives the idea which can be easily generalized

CAPTION: A generic OAB scheme for one-port WH meshes

CAPTION: Recursive doubling if the size of the mesh is not a power of two.

One-port WH tori

Optimal-time OAB algorithm uses again recursive doubling, dimension-by-dimension, but due to vertex-symmetry, it is much simpler. The reason is obvious from Figure showing 1D-torus.
OAB in WH T(z) is exactly the same as OAB in WH M(z) with the source in the corner.
The base network latency formula is also simple.
TOAB(T(z))=\sumi=1[log z](ts+[\frac{z}{2i}] td+mtm)=(ts+mtm)[log z] + td(z-1).
Compare this with the complexity for the SF model: TOAB(T(z))=(ts+td+mtm)[\frac{z}{2}].

CAPTION: An optimal OAB in 1-port wormhole 1D-torus T(8) using recursive doubling.

All-port WH meshes and tori

The lower bound on the number of rounds of OAB in n-dimensional WH all-port mesh or torus with N nodes is log2n+1N. To match this lower bound, every node, once involved in the broadcasting, must find in every round from the moment it was informed 2n uninformed nodes and deliver them the packet so that all these paths are arc-disjoint. This turned out to be a difficult problem and it has been addressed by the research community for about last 5 years as a challenging open problem. This effort seems to converge to solutions, the latest solutions provide nearly optimal results. We will mention here just some simpler discoveries made during this research quest.

Recursive tiling approach

One of the earliest attempts to solve the problem was inspired by the theory of recursive tiling of 2-D areas. Figure gives the intuition. Unfortunately, this approach

CAPTION: Tiling of torus T(5k,5k)

Extended dominating set approach

Definition

A set of dominating nodes D in a direct network G is a subset of nodes of G such that every node in G is
  1. either in D
  2. or is a neighbor of at least one node in D
The notion of dominating sets could be useful for SF networks, but it can generalized for WH networks to exploit the distance-insensitivity of routing.

Definition

Consider an all-port WH direct network G with node set V, a routing algorithms R, and two subsets of nodes D2\subset D1\subseteq V. The set D2 is said to be an extended dominating set (EDS) of D1 if and only if there exists a set of link-disjoint paths under R such that every node in D1-D2 is reachable along one such path from some node in D2.
Figure illustrates both concepts.

CAPTION: 5-node dominating set and extended dominating set in T(5,5)
The method works well for meshes M(4k,4k) and the basic idea is to build recursively a hierarchy of extended dominating nodes, where level-i EDNs are informed from all EDNs of levels j < i. Figure illustrates the construction of EDNs in mesh M(16,16) with XY routing. In the first two rounds, the source informs 4 level-3 EDNs. In round 3, these 4 nodes inform 12 level-2 EDNs. In round 4, every level-1 EDN receives one packet from either level-2 or level-3 EDN. After round 4, every uninformed node is dominated by at least one informed node so that in one more round, all nodes are informed.

CAPTION: OAB in M(16,16) based on 3-level EDN hierarchy

Dilated-diagonal-based approach

This algorithm works optimally for 2-D tori T(5k,5k), but it can be generalized and applied to arbitrary 2-D torus, in which case it requires at most 5 rounds more than the lower bound. Assume for simplicity a square torus T(n,n). The algorithm exploits the vertex symmetry of tori. It consists of three phases:
  1. Phase 1: Broadcast the packet into all rows so that each row contains exactly one copy. This can be done in [log5n] rounds by recursive splitting the torus into 5 horizontal strips of approximatively the same width and sending the packet from the source's strip to all other 4 strips in one round along link-disjoint paths using XY routing.
  2. Phase 2: Align the packets to the main diagonal in all rows in parallel. Hence, only one round is needed.
  3. Phase 3: Each node on the main diagonal views the torus as decomposed into 5 diagonal bands of approximatively same width, the main diagonal is in the middle of the first one. In one step, it informs four nodes on middle diagonals of the other four bands so that all the 4n paths are pairwise link-disjoint. This is performed recursively in each diagonal band.
If the torus is not a square, the same idea is applied, except for using so called dilated diagonals.

CAPTION: Dilated-diagonal-based OAB in a square torus.
Figure illustrates the recursive construction in phases 1 and 3.
Back to the beginning of the page Back to the CS838 class schedule