Embeddings and simulations of INs I+II

Section#6: Embeddings and simulations of INs I+II
(CS838: Topics in parallel computing, CS1221, Tue+Thu, Feb 9+11, 1999, 8:00-9:15 a.m.)

The contents

The embedding problem
Embeddings into the hypercube
Embeddings into meshes and tori
Embedding linear arrays/cycles into any network
Embeddings into hypercubic topologies
Embeddings into shuffle-based topologies

Back to the beginning of the page

Back to the CS838 class schedule

The embedding problem

The static embedding problem:

a parallel program/algorithm is described as a set of message-passing processes,
we know the size and structure of the process graph,
we have a distributed memory machine with an IN of known size and topology,
the aim is to map the process graph onto the machine so that the computation is as efficient as possible.

The dynamic embedding problem:

a parallel program/algorithm is described as a set of message-passing processes which dynamically spawn and terminate,
we do not know the size and structure of the process graph, it is data dependent, we may know some partial information (such as an upper bound of spawned processes per one parent),
we have a distributed memory machine with an IN of known size and topology,
the aim is to map dynamically processes to processing nodes of the machine so that the computation is as efficient as possible.

Basic definitions and notions

Embedding

of source graph G (with vertices V(G) and edges E(G) into host network H (with nodes V(H) and links E(H)) is a pair of mappings (\varphi,\psi) such that

\varphi:V(G)-> V(H)

\psi:E(G)-> P(H)

where P(H) is the set of all paths of network H. The quality of an embedding is measured by several parameters.

load:

the maximum number of source vertices mapped to one host node.

\load(\varphi,\psi)=max_{vin  V(H)}|{uin  V(G);\varphi(u)=v}|.

The load corresponds to the maximum number of processes which a host processor must execute concurrently. It is desirable to minimize the load.

average load:

defined similarly. An optimal average load does not necessarily mean optimal efficiency. If there is a significant gap between the maximum and average load, the load is unbalanced and lightly loaded nodes must wait for heavily loaded ones.

uniform load:

the load of all host nodes is the same up to one. This matters much more, it implies an optimal average and maximum load. The value of a uniform load is reciprocal to the value of expansion.

expansion:

the ratio of the host network size (= # of processing nodes) and the source graph size (= # of processes)

\vexp(\varphi,\psi)={|V(H)|/|V(G)|}.

The optimum value is 1, but this is not always achievable, since some networks are only partially extendable. The higher the expansion is, the more processing nodes idle during parallel computation. If \vexp(\varphi,\psi) < 1, then \load(\varphi,\psi) > 1.

dilation:

In general, embedding stretches source edges to paths in the host network. The dilation of an embedding is the maximum length of such paths taken over all source edges.

\dil(\varphi,\psi)=max_{ein  E(G)}{len(\psi(e))}.

If the images of edges are always shortest paths (need not be), then the dilation is the maximum distance in the host network between images of adjacent source vertices. The dilation corresponds to the number of intermediate host routers along a route of a message sent between neighboring source processes. In contrast to the load, large maximum dilation does not necessarily mean that the embedding will not provide a good performance of the mapped algorithm, since what really matters is

average dilation:

The impact of both maximum and average dilation on the communication performance depends on whether the routing and switching is distance sensitive or distance insensitive. Distance sensitive routing requires small dilation due to latency. But distance insensitive routing can also profit from small dilation: It consumes less links in the host network and therefore decreases the potential communication contention or even deadlocks. This motivates another important parameter.

link congestion:

the maximum number of images of source edges passing through a host edge.

\ecng(\varphi,\psi)=max_{e₂in  E(H)}|{e₁in  E(G);e₂\subseteq\psi(e₁)}|.

It represents the maximum number of interprocess communication source channels mapped on one physical host link. Link congestion should be low to minimize link or router buffers contention.

node congestion:

the maximum number of images of source edges passing through a host node.

\vcng(\varphi,\psi)=max_{u₂in  V(H)}|{e₁in  E(G);u₂in\ psi(e₁)}|.

It represents the maximum number of interprocess channels passing through one host router and its impact depends very much on the disjointedness of these paths. If the router can perform as a switch allowing multiple link-disjoint paths concurrently, then even high node congestion does not necessarily mean high latency.

average node or link congestion:

average congestion per host node/link. Again, this parameter is usually more important than the maximum values.

quasiisometric networks

G and H: G can be embedded into H and vice versa with O(1) embedding measures.

emulation:

H emulates G with slowdown h if one step of computation on G can be simulated in O(h) steps on H.

computationally equivalent networks

G and H: G can emulate H with constant slowdown and vice versa.

Remarks

Obviously, if there is an embedding of G into H with \dil=\load=1, then G is a subgraph of H and if \vexp=1, then G is a spanning subgraph of H.
Any two quasiisometric networks are computationally equivalent, but not vice versa.
In general, embeddings give useful hints on lower and upper bounds on potential performance of algorithms mapped to parallel machines with static IN topology. What embeddings cannot catch is the dynamic behavior of an application on a host network. An embedding with high dilation can provide a good mapping if processes incident to the most dilated source edges communicate less frequently than those incident to little dilated edges. Even high link congestion does not necessarily mean communication bottlenecks if the communication requests interleave in time.

Lower bounds

Some simple lower bounds on embedding measures follow just from graph theoretical properties of topologies.

Lemma

The lower bound on the load is max(1,1/\vexp).
If |V(G)|=|V(H)| and \load=1, then the lower bound on the dilation is diam(H)/diam(G).
If the average dilation is k, then the lower bound on the link congestion is k|E(G)|/|E(H)|.

Proof

Trivially follows from what we have said about uniform load.
This is so called diameter argument. Since both graphs are of the same size, every host node is loaded. Consider two vertices u,vin V(G) mapped to host nodes in distance diam(H) faces. Any shortest path P(u,v) in G is stretched to length at least diam(H) in H. The dilation will be minimal if all the edges in P(u,v) will be dilated equally.
The argument follows just from counting edges of the image of G in H and from assumption of an ideal uniform distribution of source paths over the host network.

Back to the beginning of the page

Back to the CS838 class schedule

Embeddings into the hypercube

The nice property of the hypercube is that it can simulate very efficiently almost every known IN topology. This was the main reason why the hypercube multiprocessors dominated the world of commercial massively parallel computers several years ago. Due to the symmetry of the hypercube, an embedding of graph G into Q_n can be described by a vertex or edge labeling of G.

vertex labeling:: a labeling of the vertices of G with n-bit binary addresses. If the labels of two adjacent vertices of G differ in more than one bit, the edge joining them is dilated in Q_n.
edge labeling:: a labeling of the edges of G with dimension numbers 0,1,..,n-1.
optimal hypercube:: If 1<=\vexp < 2, then Q_n is the optimal hypercube for embedding G with unit load.

Embeddings of paths and cycles

Proposition

Let u,vin V(Q_n) with Hamming distance \varrho(u,v). Let \delta(u,v) be the set of dimensions in which u and v differ. Let P(u,v) be a path of length m in Q_n. The edge labeling of P(u,v) is described by an n-tuple P=[p₀,p₁,..,p_n-1] where p_j is the number of edges of P(u,v) in dimension j. So, \sum_j=0^n-1p_j=m. Then:

|\delta(u,v)|=\varrho(u,v),
each dimension in \delta(u,v) appears odd-number times in P and each of the remaining n-\varrho(u,v) dimensions appears even-number times in P,
m=\varrho(u,v)\;(\mod 2),
if u=v, then m is even (there are no cycles of odd lengths in hypercubes),
if P(u,v) is a Hamiltonian path of Q_n, then \varrho(u,v) is odd.

Gray sequence:

the vertex labeling of a path in Q_n

n-bit Gray code:

the vertex labeling of a hamiltonian path/cycle of Q_n

binary reflected Gray code (BRGC):

the standard Gray code defined recursively (see Figure). Let G_n denote the n-bit BRGC. Then

G₁={0,1},
G_n={0G_n-1,1G_n-1^R}, where
- 0G_i and 1G_i denotes prefixing each element of G_i with bit 0 and 1, respectively,
- G_i^R is sequence G_i in reversed order.

CAPTION: Recursive construction of the binary reflected Gray code.

Algebraically, G_n can be computed as follows:

Proposition

Let b=b_n-1... b₀ be a n-bit binary number. Then its encoding in G_n is G_n(b)=g_n-1... g₀, where

g_n-1=b_n-1,
g_i=b_i+1\XOR b_i for i=n-2,...,0.

Embeddings of trees

Static trees

Despite of some structural differences between hypercubes (hypercubes are rich of cycles and regular) and trees (acyclic and nonregular), both these topologies are bipartite and hierarchically recursive and this allows efficient tree embeddings into optimal hypercubes in many cases. Let us review some facts and results:

CBT_n cannot be a subgraph of Q_n+1 even though |V(CBT_n)| < |V(Q_n+1)| (the argument is just by counting black and white vertices in 2-coloring). CBT_n is a subgraph of Q_n+2, the proof by constructing such an embedding is nontrivial (Figure(a) shows embedding of CBT₂ into Q₄).
CBT_n can be embedded in Q_n+1 with \dil=1 and \load=2 (see the cycle with edge labeling 0-2-0-2 in Figure(b), node u has load 2). Then 2^n-2+2^n-4+... hypercube nodes will have load 2 and 2^n-1-2^n-3-2^n-5-... hypercube nodes will have load 0.

CAPTION: Embedding of CBT₂ into a hypercube with \dil=1.
The best embedding of CBT_n into Q_n+1 is with \dil=2 and \load=\ecng=1. It is based on the following trick: we add another root to CBT_n. This changes the tree into a balanced (!) bipartite graph CBT'_n with 2ⁿ⁺¹ vertices. Then it can be shown by an elegant recursive construction, depicted on Figure, that CBT'_n is a spanning tree of Q_n+1.

CAPTION: Recursive embedding of CBT_n into Q_n+1 with \dil=2.
The case of n=1 is trivial (see Figure(a)). By induction, we can construct separately two identical CBT'_n-1 in two halves of Q_n+1 and by vertex symmetry of the hypercube, we can rotate and translate one subcube as shown on Figure(b). The transformation into CBT'_n is then straightforward (see Figure(c)). Note that all tree edges except exactly one have \dil=1. This embedding method has a nice algebraic formulation.

CAPTION: Examples of embeddings of complete binary trees into (a) Q₃ and (b) Q₄.
Any balanced one-legged caterpillar of size 2ⁿ is a spanning tree of Q_n. The one-legged caterpillar is a binary tree all its degree-3 vertices forming a path, called spine, and with all the other vertices adjacent to the spine. Figure gives an example.

CAPTION: An 8-vertex balanced caterpillar as a subgraph of Q₃.
There is a conjecture that any balanced binary tree with 2ⁿ vertices is a spanning subgraph of Q_n.
The problem of embedding an arbitrary tree into Q_n, where n is given, or into the optimal hypercube, with \dil=\ecng=1 is NP-complete.
There are recursive heuristics for embedding arbitrary binary trees into optimal hypercubes. The best known algorithm \comment{\cite{heun}} achieves \load=1, \dil=8, and constant edge and node congestions.

Dynamic trees

The hypercube can efficiently simulate even dynamically growing trees (for example branch-and-bound algorithms). The embedding algorithm has to make decisions based on some local partial knowledge of the tree. A process which spawns its ancestors should use some heuristics to distribute its children onto processors so that the processor load is likely to be balanced across the hypercube. The following algorithm was shown to provide satisfactory results for dynamic binary trees. It uses randomization based on flipping a coin.

The root of the tree is placed into an arbitrary hypercube node.
Each embedded tree vertex x
- knows the host hypercube dimension n>= 1,
- knows its hypercube labeling \varphi(x),
- knows its level number in the tree l(x),
- computes its track number t(x)=l(x)\;(\mod n), represented as an n-bit number with a single 1 at position t(x),
- has random generator of its flip bit \fb(x).
If a tree vertex x which has been a tree leaf embedded into hypercube node \varphi(x), becomes a parent of one or two children, it generates its flip bit and performs the following steps:
if \fb(x)=0 then
embed left son (if it exists) back into \varphi(x);
embed right son (if it exists) into \varphi(x)\XOR t(x);
else
embed left son (if it exists) into \varphi(x)\XOR t(x);
embed right son (if it exists) back into \varphi(x).

This simple approach gives surprisingly good results. For an M-vertex binary tree and an N-node hypercube, it achieves \dil=1 and \load=O(M/N+log N) with high probability.

Simulation of binary Divide&Conquer computation on a hypercube

Many problems lead to the following binary parallel n-level Divide&Conquer (D&C) computation:

the root process splits the problem (input data) into two halves and passes them to its two children,
the children do recursively the same,
at level n, the granularity of subproblems is small enough to be solved be leaf processes,
results are recursively moved back to the root which combines them into the final result.

We may use a standard embedding of CBT_n in Q_n+1 with \load=1 (see Subsection \ref{StaticTreestoQ}) to simulate such a computation on Q_n+1, but it would be inefficient, since one half of hypercube nodes, simulating the internal vertices of CBT_n, would idle most of the time. However, there is a trivial method how to simulate such a computation optimally on Q_n.

CAPTION: Simulation of 3-level D&C computation on Q₃.

the root node splits the problem (input data) into two halves,
it hands over one half to its neighbor in dimension 0 and retains the other half,
both these two nodes are active and do the same with their halves using dimension 1,
this is repeated for dimensions i=2,...,n,
all the nodes of Q_n become leaves of CBT_n and compute the leaf subproblems,
results are folded back in the reverse order.

This corresponds to an embedding of CBT_n into Q_n as shown on Figure(a). Left children are mapped onto the same hypercube node as their parents. The embedded tree forms so called binomial spanning tree of the hypercube (see Figure(b)), which we will see again in Section 10 of this course.

Embeddings of meshes-of-trees

Even though it is known that MT_n is even a subgraph of Q_2n+2 (which is the optimal hypercube), we will only explain here an easy way how to embed MT_n into Q_2n+2 using the results above. We just use these three facts:

Q_2n+2=Q_n+1 × Q_n+1,
MT_n\subsetCBT_n× CBT_n,
CBT_n can be embedded into Q_n+1 with \dil=2 and \load=\ecng=1 (see Subsection \ref{StaticTreestoQ})).

This gives immediately an embedding of MT_n into Q_2n+2 with \dil=2 and \load=\ecng=1:

embed CBT_n into Q_n+1 with \dil=2 and \load=\ecng=1,
apply cartesian product to get an embedding of CBT_n× CBT_n into Q_2n+2 with \dil=2 and \load=\ecng=1,
convert CBT_n× CBT_n into MT_n by removing superfluous edges.

CAPTION: MT₂ embedded in Q₆ with \load=\ecng=1 and \dil=2.

Embeddings of meshes

Again, the basic approach takes advantage of the fact that the hypercube and the meshes are defined recursively as cartesian products of lower-dimensional hypercubes and meshes, respectively.

Linear array M(z) is a subgraph of its optimal hypercube Q_n using any Gray code, where n=[log z].
M(2^n₁, 2^n₂,..,2^n_k), n_i>=1, is a spanning subgraph of Q_n, n=n₁+...+n_k. The construction is trivial just by embedding each M(z_i) into corresponding Q_{n_i}, where n_i=[log z_i], and by applying cartesian product to the results. This can be generalized:
If [log z₁]+...+[log z_k]=[log(z₁... z_k)], then M(z₁,.., z_k) is a subgraph of its optimal hypercube Q_n, where n=[log(z₁... z_k)]. Algebraically, this mapping is computed as follows:
A vertex (x₁,...,x_k) of M(z₁,...,z_k), 0<= x_i<= z_i-1, is mapped into Q_n to node
```
G_l₁(\bin_l₁(x₁))\circ G_l₂(\bin_l₂(x₂))\circ... \circ G_{l_k}(\bin_{l_k}(x_k)),
```
where
- \bin_m(x) denotes m-bit binary representation of integer x < 2^m,
- l_i=[log z_i],
- G_l is BRGC encoding on l bits, defined on Page,
- \circ is string concatenation.
If [log z₁]+...+[log z_k] > [log(z₁... z_k)], this ``cartesian decomposition'' method gives embeddings with \vexp > 2, thus not optimal. To achieve optimality, more sophisticated methods must be used. We state just some results here.
Any 2-D mesh M(a,b) such that [log a]+[log b]=[log(ab)]+1 can be embedded into its optimal hypercube Q_[log(ab)] with \load=1 and \dil=\ecng=\vcng=2.
A similar best known algorithm for 3-D meshes gives embedding with dilation 5.
Embeddings of higher-dimensional meshes into optimal hypercubes with load one and minimal dilation are still an open problem. The best known result is that n-dimensional mesh can be embedded with \dil=4n+1.

Embeddings of hypercubic networks

If n=2^k for some integer k, then CCC_2^k is a spanning subgraph of its optimal hypercube Q_n+k (each cycle of CCC_n becomes a hamiltonian cycle of corresponding Q_k in Q_n+k=Q_n× Q_k).
If n is even otherwise, then CCC_n is a subgraph of its optimal hypercube Q_{n+[log n]}.
If n is odd, then CCC_n can be embedded into its optimal hypercube Q_{n+[log n]} with \dil=2 and \load=\ecng=\vcng=1 (since odd-length cycle have dilation at least two).
In Subsection \ref{wBFtoCCC}, we show that this implies immediately an embedding of wBF_n into optimal hypercube with constant dilation and link congestion.

CAPTION: oBF_n\subset Q_{n+[log (n+1)]}.
The optimal hypercube for oBF_n is Q_{n+[log (n+1)]} and oBF_n is its subgraph. Since both the hypercube and the ordinary butterfly are hierarchically recursive, the construction is again nicely recursive. Figure(a) shows, just by redrawing the graph topology, that oBF₂\subset Q₄ and Figure(b) shows the induction step.

Embeddings of other graphs

Given graph G and integers k and n, the problem of embedding G into Q_n with dilation k is known to be NP-complete.
Embeddings of de Bruijn graphs with constant dilation and expansion into the hypercube are not known.

Back to the beginning of the page

Back to the CS838 class schedule

Embeddings into meshes and tori

2-D or 3-D meshes/tori are currently the most commonly used IN topologies, therefore methods how to embed other networks into them or how to simulate them on them are of high practical interest.

Lemma

Given a graph G, integers k and n, the problem to embed G into an n-dimensional mesh with \dil<= k is NP-complete\comment{\cite{bhattcosmakis}}. This holds even for k=1, n=2, and G = binary tree.

On the other hand, problems of embeddings into 2-D meshes are very close to problems of designing VLSI layouts and many solutions/approaches are known\comment{\cite{ullman}}.

Embeddings among meshes and tori

Lemma

Meshes and tori are computationally equivalent, since they are quasiisometric.

Proof
Given integers z₁,...,z_n, let M=M(z₁,...,z_n) and T=T(z₁,...,z_n).

M is a subgraph of T so that T can simulate M with no slowdown.
On the other hand, T can be embedded into M with \load=1 and \dil=\ecng=2.
- Decompose M into cartesian product M(z₁)× ...× M(z_n) and T into T(z₁)× ...× T(z_n).
- Embed each cycle T(z_i) into M(z_i) with \load=1 and \dil=\ecng=2 as shown on Figure(a) for z₁=5.
- The embedding of T into M is composed by using the cartesian product. Figure(b) shows 2-dimensional case.

CAPTION: Optimal embedding of tori into meshes with \load=1 and \dil=\ecng=2.
If the mesh and torus do not have the same dimensionality or side lengths, we first embed tori into tori or meshes into meshes and then apply this solution.

Embeddings of tori into tori

In case of general higher-dimensional rectangular and cube tori, the problem of optimal embeddings is hard and open. Optimal solutions are known only for special low-dimensional cases. We will mention one such result.

Lemma

Let z₁\not=z₂. Then 2-D torus T(z₁,z₂) can be embedded into cycle T(z₁z₂) with \load=1 and with optimal \dil=\ecng=min(z₁,z₂).

Proof
The embedding is simple and straightforward: it is based on one-to-one vertex mapping of T(z₁,z₂) which is either row-major if z₁>= z₂ or column-major otherwise. See Figure for an example.

CAPTION: Optimal embedding of 2-D torus T(3,4) into cycle T(12) with \dil=3=min(3,4) based on column-major vertex mapping. Horizontal edges of T(3,4) have \dil=3 and vertical wrap-around edges have \dil=2.

Embeddings of meshes into meshes

Here we will explain some methods how meshes can simulate each other if they are of the same size but differ in the number of dimensions and/or the lengths of sides. It is again a difficult problem in general, we will discuss only several special cases for which solutions are known. A mesh with nonequal side lengths is called rectangular here. The notation for cube meshes will indicate the dimensionality with a subscript at M. For example, M_d(w,...,w) denotes a d-dimensional cube mesh.

Embeddings of cube meshes into rectangular meshes

Theorem

Let n>= 2 and let z₁>= ...>= z_n>= 2 be different integers. Let N=\Pi_i=1ⁿ z_i, w=N^{1/ n}, R=M(z₁,...,z_n), and S=M_n(w,...,w). Without loss of generality, assume that w is integer. Then R can simulate S with slowdown {max_i z_i/ w}.

Proof
Inductive by showing that there exists an embedding of S into R with \dil=\ecng=O((max_i z_i)/w) and constant load.

(i) Induction basis:

Let n=2 and z₁ > z₂. Then S=M(w,w) can be embedded into R=M(z₁,z₂) with \dil=[ z₁/w], \load=2, and \ecng=O([ z₁/w]). We show here one possible solution, but there are other ones with similar properties. Columns of S are placed successively into R left-to-right along the first dimension besides each other without overlapping. Each of them starts in the first row of R and snakes vertically through R rightward. See Figure. Since some nodes of R are wasted, the last column of R is reached before all columns of S are embedded, but in that case the snaking simply bounces mirror-like back. Some nodes of R will have \load=2 and some \load=0. The width of one snake in R is [ w/z₂]=[ z₁/w]=[max(z₁,z₂)/w] and this is the dilation of horizontal edges of S and link congestion of horizontal links in R is 1+z₁/w.

CAPTION: Embedding of a 2-D square mesh into a 2-D rectangular mesh.

(ii) Auxiliary lemma:

Lemma

M(z₁,...,z_n-2,z_n-1z_n) is a subgraph of M(z₁,...,z_n-1,z_n).

Proof
We use again a simple snake-like embedding. M(z_n-1z_n) can be snaked through M(z_n-1, z_n) with unit measures as shown on Figure and the general case follows from the cartesian-product property.

CAPTION: Snaking M(z_n-1z_n) through 2-dimensional M(z_n-1,z_n).

(iii) Induction step:

Decompose R into w submeshes R_i=M([ z₁/w],z₂,...,z_n), i=1,...,w, and S into w submeshes S_i=M_n(1,w,...,w). The embedding of S into R is composed of one-to-one embeddings of S_i into R_i along the direction of z₁ in R (see Figure). Since the width of R_i along the first dimension is [ z₁/w], the edges in the first dimension of S have dilation [ z₁/w]. Individual embeddings of S_i into R_i for i=1,...,w are by induction. Since each S_i is isomorphic to an (n-1)-dimensional cube mesh M_n-1(w,...,w), we can embed it by induction into (n-1)-dimensional mesh M([ z₁z_n/w],z₂,...,z_n-1) with dilation

[{max([ z₁z_n/w],z₂)/ w}]<=[ z₁/w],

\ecng=O(z₁/w), and \load=2. By Lemma, (n-1)-dimensional mesh M([ z₁ z_n/w], z₂, ...,z_n-1) is a subgraph of n-dimensional mesh R_i=M([ z₁/w],z₂,...,z_n). It follows that each S_i can be embedded into R_i with \dil=[ z₁/w], \ecng=O(z₁/w), and \load=2.

CAPTION: Recursive embedding of n-dimensional S into R.

Corollary

S can be simulated by (n-k)-dimensional mesh M_n-k(N^{1 / n-k},...,N^{1 / n-k}) with slowdown N^{k / n(n-k)}.

Proof
By Theorem, S can be embedded into R=M_n(N^{1 / n-k},...,N^{1 / n-k},1,...,1) with dilation

{N^{1 / n-k}/ N^{1 / n}}=N^{k / n(n-k)}

and R is isomorphic to (n-k)-dimensional M_n-k(N^{1 / n-k},...,N^{1 / n-k}).

Embeddings of rectangular meshes into cube meshes

Theorem

Let R and S be meshes defined as in Theorem. Then R can be embedded into S with constant dilation.

Proof

(i) Induction basis:

Let n=2. Then R=M(z₁,z₂) can be embedded into S=M(w,w) with \dil=1 and \load=\ecng=2 by a trivial snaking combined with perpendicular overlap bending at vertical boundaries of S as shown on Figure. Note that the nodes of S close to its vertical boundaries have \load=0 or \load=2.

CAPTION: Embedding of M(z₁,z₂) into M(w,w) with \dil=1.

(ii) Induction step:

Let R'=M(z₂,...,z_n), S'=M_n-1(w,...,w), r=|V(R')|, and s=|V(S')|. Since z₁ > w, it follows that r={N / z₁} < s={N / w}. Thus {s / r}=z₁/w > 1. By definition, R=M(z₁)× R' and S=M(w)× S'. We split (n-1)-dimensional mesh S' into [{s / r}] cube submeshes S_i'=M_n-1(r^{1 / n-1},...,r^{1 / n-1}) (Figure(c)). R can be split along the first dimension into z₁ (n-1)-dimensional submeshes R_i, isomorphic with R', i.e., of size r each (Figure(a)).

CAPTION: Embedding of n-dimensional rectangular mesh into d-dimensional cube mesh with \dil=1.
By induction, R₁ can be embedded into S₁' with unit dilation. The rest of mesh R is reshaped in the same way so that R can be viewed as a mesh M_n(z₁,r^{1 / n-1},...,r^{1 / n-1}) (Figure(b)). This folded mesh is then snaked through S in the way depicted in Figure(c).

Embeddings among 2-D rectangular meshes

A practically important problem is embedding of one 2-D mesh into another one of approximately the same size with \load=1. There exist some partial solutions\comment{\cite{ellis}}. We will not describe the methods here due to the lack of space. They are usually based on small primitive tiles from which embeddings of larger meshes are constructed by composing, squeezing, and folding.

Lemma

Consider M(a,b) and M(a',b') such that a' < a<= b and b' is the minimum integer satisfying a'b'>= ab.

This result is optimal, since the lower bounds on dilation and congestion are 2.
If a/a'<= 2, then M(a,b) can be embedded into M(a',b') with \dil<=2.
If a/a'<= 3, then M(a,b) can be embedded into M(a',b') with \dil<=3.

CAPTION: Embedding of M(2,9) into M(3,6) with \dil=2 and \ecng=3.

Simulation of parallel binary D&C computation on 2-D meshes

Lemma

CBT_2m can be embedded into M(2^m,2^m) with bidirectional links so that

\load=2: Each mesh node is loaded with exactly one leaf vertex and (except one) exactly one internal vertex
distinct tree edges are mapped on arc-disjoint links in the mesh
\dil=2^m+O(m)

Proof
The embedding is again nicely recursive. The basic idea follows from Figure. The induction base is in parts (a)-(c). Part (c) shows CBT₄, which is embedded in M(2²,2²) using either pattern A or B. Note than in each pattern, one node is loaded with only one leaf and can therefore become a new root when building larger trees from these building blocks. Also there are some free arcs which can be used for joining subtrees to new roots. Part (d) illustrates the induction step.

CAPTION: Simulation of D&C computation on a square mesh.

Embeddings of linear arrays and cycles into high-dimensional meshes and tori

Hamiltonian paths/cycles:

We have seen in Section 5 that

any T(z₁,...,z_n) has a hamiltonian cycle,
any M(z₁,...,z_n) has a hamiltonian path,
M(z₁,...,z_n) has a hamiltonian cycle iff at least one z_i is even.

Recursive patterns:

Cube meshes M_n(k,...,k), where k=2ⁱ for some integer i have hamiltonian paths/cycles with a nicely recursive structure. Let us demonstrate for 2-D mesh M_i=M(2ⁱ,2ⁱ) one solution which is suitable for example for mesh-of-trees or pyramidal meshes.

Induction base:

M₁ is trivial (see Figure(a)).

Induction step:

split M_i into four quarters M_i-1,
construct hamiltonian path recursively in each of them, so that the upper two patterns are rotated by 90 degrees as shown in Figure(b),
join the terminal vertices of these paths as shown in the same figure.

Peano curves:

If large dilation is not an issue (for example in case of distance insensitive routing), meshes can be traversed in a fractal-like way. Very useful are so called Peano curves (see Figure(c)). They have dilation \dil=\Theta(\sqrt{N}), but their fractal structure does not require rotation, it needs just a translation.

CAPTION: Recursive numbering schemes for square 2-D meshes.

Back to the beginning of the page Back to the CS838 class schedule

Embedding linear arrays/cycles into any network

Linear arrays and cycles as the most trivial topologies can be embedded with constant measures into any network, this is a classical result by Sekanina\comment{\cite{sekanina}}. \begin{lemma} An N-vertex cycle C can be embedded into any N-node network G with \load=1, \dil<= 3, and \ecng=2. \end{lemma} Proof
Constructive

construct a spanning tree T_G of G,
partition its nodes into two classes V₀ = nodes at even-numbered levels of T_G and V₁ = nodes at odd-numbered levels of T_G,
embed C in G by traversing T_G in the depth-first order and by placing vertices of C into nodes of T_G using the following rule:
- the currently visited node x of T_G is in V₁ and it is our first visit of x,
- the currently visited node x of T_G is in V₀ and it is our last visit of x.

Each node of T_G will be loaded exactly once. The proof that \dil=3 follows from the recursive structure of any spanning tree.

CAPTION: Embedding an N-vertex cycle into N-node network G with \dil=3 and \load=1.

Back to the beginning of the page Back to the CS838 class schedule

Embeddings into hypercubic topologies

Lemma

Both types of butterflies and CCC are computationally equivalent.

Proof

CCC_n is a spanning subgraph of wBF_n.

The embedding is based on a one-to-one vertex mapping \varphi:V(CCC_n)-> V(wBF_n) defined as \varphi((i,x))=(i+_n \parity(x),x), where \parity(x) returns 1 if x has odd parity, and 0 otherwise. Rotation of all odd-numbered cycles of CCC_n by one position forward gets them into the same position as in wBF_n.

CAPTION: CCC₃ (thick lines) spans wBF₃.

wBF_n can be embedded into CCC_n with \dil=\ecng=2.

The argument is based on vertex symmetry of both topologies and idempotent vertex mapping of each elementary butterfly 2× 2 onto CCC paths consisting of 3 edges: cycle+hypercube+cycle (see Figure). So, CCC_n can simulate wBF_n with slowdown 2.

CAPTION: The idea of embedding wBF_n into CCC_n with \dil=\ecng=2 and \load=1. Here, x'=\non_i(x).

oBF_n can be embedded into wBF_n with \load=2 and \dil=1.

Just by merging terminal vertices of rows oBF_n, we get cycles if wBF_n. Hence wBF_n can simulate oBF_n with slowdown 2.

wBF_n can be embedded into oBF_n with constant dilation.

Proof omitted.

There are other two important results related to hypercubic networks which we will mention here.

Lemma

an N-node hypercubic network can execute an N-node normal hypercube algorithm with only a constant slowdown
an N-node butterfly can simulate any bounded-degree N-node network with slowdown O(log N) supposing some preprocessing is allowed.

We will not prove here the first statement. The second statement will be proven in Section 9 of this course. The class of normal hypercube algorithms comprises a large number of important algorithms, e.g., matrix computations, sorting, collective communication algorithms.

Definition

A hypercube algorithm is said to be normal if only one dimension of hypercube edges is used at any step of the algorithm and if consecutive dimensions are used in consecutive steps.

Back to the beginning of the page

Back to the CS838 class schedule

Embeddings into shuffle-based topologies

Lemma

dB_2,n and SE_n are computationally equivalent.

Proof

SE_n is a spanning subgraph of dB_2,n.

The embedding is again based on one-to-one vertex mapping \varphi:V(SE_n)-> V(dB_2,n) such that \varphi(u)=\sigma^-\parity(u)(u) where \parity is parity is as in Lemma and \sigma is the shuffle operation. Hence, we simply unshuffle all necklaces with vertices with odd parity. Figure shows the undirected case for n=3.

CAPTION: Undirected SE₃ is a spanning subgraph of undirected dB_2,n.

dB_2,n can be embedded into SE_n with \load=1 and \dil=2.

the de Bruijn operation of left-shifting accumulates the effects of both shuffle and exchange operations and the argument follows just from using the idempotent vertex mapping.

But there are other interesting structural properties.

Lemma

dB_2,n is isomorphic to SE_n+1 with contracted exchange arcs.

Proof
If we splice every pair of vertices u_n... u₁0 and u_n... u₁1 in SE_n+1 into vertex u_n... u₁, we get dB_2,n.

Lemma

Undirected SE_n and dB_2,n can simulate n-dimensional normal hypercube algorithms with almost no slowdown.

Proof
Assume a normal hypercube algorithm using consecutively dimensions J=[j₁,j₂,...]. Two consecutive dimension numbers j_i and j_i+1 differ by 1 modulo n.

SE_n can simulate a t-step normal Q_n algorithm in 2t-1 communication steps.: The initial mapping of hypercube nodes onto undirected SE_n is such that each node u_n-1... u₀ is mapped to u_j₁-1... u₀ u_n-1... u_j₁. Then the first step of the normal hypercube algorithm can be directly simulated, since the nodes who need communicate in Q_n are linked by exchange edges in SE_n. Assume j₂=j₁+1. Using the shuffle-edges, we first move every hypercube node u_n-1... u₀ from u_j₁-1... u₀u_n-1... u_j₁ to u_j₁... u₀u_n-1... u_j₁+1=u_j₂-1... u₀u_n-1... u_j₂. In other words, we shuffle all necklaces. Then we can again simulate the hypercube communication along dimension j₂ in one step using the exchange arcs in SE_n. (If j₂=j₁-1, we would unshuffle necklaces.) The simulation proceeds in this way for all subsequent steps. So one step of the hypercube algorithm is simulated by two steps of undirected SE_n. If the overhead of the initial mapping is ignored, the total number of steps of the simulation is 2t-1.
dB_2,n can simulate a t-step normal Q_n algorithm in t communication steps and 2t computational steps.: It we contract exchange edges into vertices to get dB_2,n-1 from SE_n, the same vertex mapping enables the same simulation to be performed on undirected dB_2,n-1 in t communication steps since the communication along the exchange edges is internal within de Bruijn nodes. Of course, one de Bruijn node simulates two hypercube nodes, so that the local processing is twice of that in the hypercube.

Back to the beginning of the page

Back to the CS838 class schedule

Section#6: Embeddings and simulations of INs I+II(CS838: Topics in parallel computing, CS1221, Tue+Thu, Feb 9+11, 1999, 8:00-9:15 a.m.)

The contents

The embedding problem

Basic definitions and notions

Remarks

Lower bounds

Embeddings into the hypercube

Embeddings of paths and cycles

Embeddings of trees

Static trees

Dynamic trees

Simulation of binary Divide&Conquer computation on a hypercube

Embeddings of meshes-of-trees

Embeddings of meshes

Embeddings of hypercubic networks

Embeddings of other graphs

Embeddings into meshes and tori

Embeddings among meshes and tori

Embeddings of tori into tori

Embeddings of meshes into meshes

Embeddings of cube meshes into rectangular meshes

Embeddings of rectangular meshes into cube meshes

Embeddings among 2-D rectangular meshes

Simulation of parallel binary D&C computation on 2-D meshes

Embeddings of linear arrays and cycles into high-dimensional meshes and tori

Embedding linear arrays/cycles into any network

Embeddings into hypercubic topologies

Embeddings into shuffle-based topologies

Section#6: Embeddings and simulations of INs I+II
(CS838: Topics in parallel computing, CS1221, Tue+Thu, Feb 9+11, 1999, 8:00-9:15 a.m.)