Section#6: Embeddings and simulations of INs I+II
(CS838: Topics in parallel computing, CS1221, Tue+Thu, Feb 9+11, 1999, 8:009:15 a.m.)
The contents
 The embedding problem
 Embeddings into the hypercube
 Embeddings into meshes and tori
 Embedding linear arrays/cycles into any network
 Embeddings into hypercubic topologies
 Embeddings into shufflebased topologies
The embedding problem
 The static embedding problem:

 a parallel program/algorithm is described as a set of messagepassing processes,
 we know the size and structure of the process graph,
 we have a distributed memory machine with an IN of known size and topology,
 the aim is to map the process graph onto the machine so that the computation is as efficient as possible.
 The dynamic embedding problem:

 a parallel program/algorithm is described as a set of messagepassing processes which dynamically spawn and terminate,
 we do not know the size and structure of the process graph, it is data dependent, we may know some partial information (such as an upper bound of spawned processes per one parent),
 we have a distributed memory machine with an IN of known size and topology,
 the aim is to map dynamically processes to processing nodes of the machine so that the computation is as efficient as possible.
Basic definitions and notions
 Embedding
 of source graph G (with vertices V(G) and edges E(G) into host network H (with nodes V(H) and links E(H)) is a pair of mappings (\varphi,\psi) such that
\varphi:V(G)> V(H)
\psi:E(G)> P(H)
where P(H) is the set of all paths of network H. The quality of an embedding is measured by several parameters.
 load:
 the maximum number of source vertices mapped to one host node.
\load(\varphi,\psi)=max_{vin V(H)}{uin V(G);\varphi(u)=v}.
The load corresponds to the maximum number of processes which a host processor must execute concurrently. It is desirable to minimize the load.
 average load:
 defined similarly. An optimal average load does not necessarily mean optimal efficiency. If there is a significant gap between the maximum and average load, the load is unbalanced and lightly loaded nodes must wait for heavily loaded ones.
 uniform load:
 the load of all host nodes is the same up to one. This matters much more, it implies an optimal average and maximum load. The value of a uniform load is reciprocal to the value of expansion.
 expansion:
 the ratio of the host network size (= # of processing nodes) and the source graph size (= # of processes)
\vexp(\varphi,\psi)={V(H)/V(G)}.
The optimum value is 1, but this is not always achievable, since some networks are only partially extendable. The higher the expansion is, the more processing nodes idle during parallel computation. If \vexp(\varphi,\psi) < 1, then \load(\varphi,\psi) > 1.
 dilation:
 In general, embedding stretches source edges to paths in the host network. The dilation of an embedding is the maximum length of such paths taken over all source edges.
\dil(\varphi,\psi)=max_{ein E(G)}{len(\psi(e))}.
If the images of edges are always shortest paths (need not be), then the dilation is the maximum distance in the host network between images of adjacent source vertices. The dilation corresponds to the number of intermediate host routers along a route of a message sent between neighboring source processes. In contrast to the load, large maximum dilation does not necessarily mean that the embedding will not provide a good performance of the mapped algorithm, since what really matters is
 average dilation:
 The impact of both maximum and average dilation on the communication performance depends on whether the routing and switching is distance sensitive or distance insensitive. Distance sensitive routing requires small dilation due to latency. But distance insensitive routing can also profit from small dilation: It consumes less links in the host network and therefore decreases the potential communication contention or even deadlocks. This motivates another important parameter.
 link congestion:
 the maximum number of images of source edges passing through a host edge.
\ecng(\varphi,\psi)=max_{e2in E(H)}{e_{1}in E(G);e_{2}\subseteq\psi(e_{1})}.
It represents the maximum number of interprocess communication source channels mapped on one physical host link. Link congestion should be low to minimize link or router buffers contention.
 node congestion:
 the maximum number of images of source edges passing through a host node.
\vcng(\varphi,\psi)=max_{u2in V(H)}{e_{1}in E(G);u_{2}in\ psi(e_{1})}.
It represents the maximum number of interprocess channels passing through one host router and its impact depends very much on the disjointedness of these paths. If the router can perform as a switch allowing multiple linkdisjoint paths concurrently, then even high node congestion does not necessarily mean high latency.
 average node or link congestion:
 average congestion per host node/link. Again, this parameter is usually more important than the maximum values.
 quasiisometric networks
 G and H: G can be embedded into H and vice versa with O(1) embedding measures.
 emulation:
 H emulates G with slowdown h if one step of computation on G can be simulated in O(h) steps on H.
 computationally equivalent networks
 G and H: G can emulate H with constant slowdown and vice versa.
Remarks
 Obviously, if there is an embedding of G into H with \dil=\load=1, then G is a subgraph of H and if \vexp=1, then G is a spanning subgraph of H.
 Any two quasiisometric networks are computationally equivalent, but not vice versa.
 In general, embeddings give useful hints on lower and upper bounds on potential performance of algorithms mapped to parallel machines with static IN topology. What embeddings cannot catch is the dynamic behavior of an application on a host network. An embedding with high dilation can provide a good mapping if processes incident to the most dilated source edges communicate less frequently than those incident to little dilated edges. Even high link congestion does not necessarily mean communication bottlenecks if the communication requests interleave in time.
Lower bounds
Some simple lower bounds on embedding measures follow just from graph theoretical properties of topologies.
Lemma
 The lower bound on the load is max(1,1/\vexp).
 If V(G)=V(H) and \load=1, then the lower bound on the dilation is diam(H)/diam(G).
 If the average dilation is k, then the lower bound on the link congestion is kE(G)/E(H).
Proof
 Trivially follows from what we have said about uniform load.
 This is so called diameter argument. Since both graphs are of the same size, every host node is loaded. Consider two vertices u,vin V(G) mapped to host nodes in distance diam(H) faces. Any shortest path P(u,v) in G is stretched to length at least diam(H) in H. The dilation will be minimal if all the edges in P(u,v) will be dilated equally.
 The argument follows just from counting edges of the image of G in H and from assumption of an ideal uniform distribution of source paths over the host network.
Embeddings into the hypercube
The nice property of the hypercube is that it can simulate very efficiently almost every known IN topology. This was the main reason why the hypercube multiprocessors dominated the world of commercial massively parallel computers several years ago.
Due to the symmetry of the hypercube, an embedding of graph G into Q_{n} can be described by a vertex or edge labeling of G.
 vertex labeling:
 a labeling of the vertices of G with nbit binary addresses. If the labels of two adjacent vertices of G differ in more than one bit, the edge joining them is dilated in Q_{n}.
 edge labeling:
 a labeling of the edges of G with dimension numbers 0,1,..,n1.
 optimal hypercube:
 If 1<=\vexp < 2, then Q_{n} is the optimal hypercube for embedding G with unit load.
Embeddings of paths and cycles
Proposition
Let u,vin V(Q_{n}) with Hamming distance \varrho(u,v). Let \delta(u,v) be the set of dimensions in which u and v differ. Let P(u,v) be a path of length m in Q_{n}. The edge labeling of P(u,v) is described by an ntuple P=[p_{0},p_{1},..,p_{n1}] where p_{j} is the number of edges of P(u,v) in dimension j. So, \sum_{j=0}^{n1}p_{j}=m. Then:
 \delta(u,v)=\varrho(u,v),
 each dimension in \delta(u,v) appears oddnumber times in P and each of the remaining n\varrho(u,v) dimensions appears evennumber times in P,
 m=\varrho(u,v)\;(\mod 2),
 if u=v, then m is even (there are no cycles of odd lengths in hypercubes),
 if P(u,v) is a Hamiltonian path of Q_{n}, then \varrho(u,v) is odd.
 Gray sequence:
 the vertex labeling of a path in Q_{n}
 nbit Gray code:
 the vertex labeling of a hamiltonian path/cycle of Q_{n}
 binary reflected Gray code (BRGC):
 the standard Gray code defined recursively (see Figure). Let G_{n} denote the nbit BRGC. Then
 G_{1}={0,1},
 G_{n}={0G_{n1},1G_{n1}^{R}}, where
 0G_{i} and 1G_{i} denotes prefixing each element of G_{i} with bit 0 and 1, respectively,
 G_{i}^{R} is sequence G_{i} in reversed order.
CAPTION: Recursive construction of the binary reflected Gray code.
 Algebraically, G_{n} can be computed as follows:
Proposition
Let b=b_{n1}... b_{0} be a nbit binary number. Then its encoding in G_{n} is G_{n}(b)=g_{n1}... g_{0}, where
 g_{n1}=b_{n1},
 g_{i}=b_{i+1}\XOR b_{i} for i=n2,...,0.
Embeddings of trees
Static trees
Despite of some structural differences between hypercubes (hypercubes are rich of cycles and regular) and trees (acyclic and nonregular), both these topologies are bipartite and hierarchically recursive and this allows efficient tree embeddings into optimal hypercubes in many cases. Let us review some facts and results:
 CBT_{n} cannot be a subgraph of Q_{n+1} even though V(CBT_{n}) < V(Q_{n+1}) (the argument is just by counting black and white vertices in 2coloring). CBT_{n} is a subgraph of Q_{n+2}, the proof by constructing such an embedding is nontrivial (Figure(a) shows embedding of CBT_{2} into Q_{4}).
 CBT_{n} can be embedded in Q_{n+1} with \dil=1 and \load=2 (see the cycle with edge labeling 0202 in Figure(b), node u has load 2). Then 2^{n2}+2^{n4}+... hypercube nodes will have load 2 and 2^{n1}2^{n3}2^{n5}... hypercube nodes will have load 0.
CAPTION: Embedding of CBT_{2} into a hypercube with \dil=1.
 The best embedding of CBT_{n} into Q_{n+1} is with \dil=2 and \load=\ecng=1. It is based on the following trick: we add another root to CBT_{n}. This changes the tree into a balanced (!) bipartite graph CBT'_{n} with 2^{n+1} vertices. Then it can be shown by an elegant recursive construction, depicted on Figure, that CBT'_{n} is a spanning tree of Q_{n+1}.
CAPTION: Recursive embedding of CBT_{n} into Q_{n+1} with \dil=2.
The case of n=1 is trivial (see Figure(a)). By induction, we can construct separately two identical CBT'_{n1} in two halves of Q_{n+1} and by vertex symmetry of the hypercube, we can rotate and translate one subcube as shown on Figure(b). The transformation into CBT'_{n} is then straightforward (see Figure(c)). Note that all tree edges except exactly one have \dil=1. This embedding method has a nice algebraic formulation.
CAPTION: Examples of embeddings of complete binary trees into (a) Q_{3} and (b) Q_{4}.
 Any balanced onelegged caterpillar of size 2^{n} is a spanning tree of Q_{n}. The onelegged caterpillar is a binary tree all its degree3 vertices forming a path, called spine, and with all the other vertices adjacent to the spine. Figure gives an example.
CAPTION: An 8vertex balanced caterpillar as a subgraph of Q_{3}.
 There is a conjecture that any balanced binary tree with 2^{n} vertices is a spanning subgraph of Q_{n}.
 The problem of embedding an arbitrary tree into Q_{n}, where n is given, or into the optimal hypercube, with \dil=\ecng=1 is NPcomplete.
 There are recursive heuristics for embedding arbitrary binary trees into optimal hypercubes. The best known algorithm \comment{\cite{heun}} achieves \load=1, \dil=8, and constant edge and node congestions.
Dynamic trees
The hypercube can efficiently simulate even dynamically growing trees (for example branchandbound algorithms). The embedding algorithm has to make decisions based on some local partial knowledge of the tree. A process which spawns its ancestors should use some heuristics to distribute its children onto processors so that the processor load is likely to be balanced across the hypercube. The following algorithm was shown to provide satisfactory results for dynamic binary trees. It uses randomization based on flipping a coin.
 The root of the tree is placed into an arbitrary hypercube node.
 Each embedded tree vertex x
 knows the host hypercube dimension n>= 1,
 knows its hypercube labeling \varphi(x),
 knows its level number in the tree l(x),
 computes its track number t(x)=l(x)\;(\mod n),
represented as an nbit number with a single 1 at position t(x),
 has random generator of its flip bit \fb(x).
 If a tree vertex x which has been a tree leaf embedded into hypercube node \varphi(x), becomes a parent of one or two children, it generates its flip bit and performs the following steps:
if \fb(x)=0 then
embed left son (if it exists) back into \varphi(x);
embed right son (if it exists) into \varphi(x)\XOR t(x);
else
embed left son (if it exists) into \varphi(x)\XOR t(x);
embed right son (if it exists) back into \varphi(x).
This simple approach gives surprisingly good results. For an Mvertex binary tree and an Nnode hypercube, it achieves \dil=1 and \load=O(M/N+log N) with high probability.
Simulation of binary Divide&Conquer computation on a hypercube
Many problems lead to the following binary parallel nlevel Divide&Conquer (D&C) computation:
 the root process splits the problem (input data) into two halves and passes them to its two children,
 the children do recursively the same,
 at level n, the granularity of subproblems is small enough to be solved be leaf processes,
 results are recursively moved back to the root which combines them into the final result.
We may use a standard embedding of CBT_{n} in Q_{n+1} with \load=1 (see Subsection \ref{StaticTreestoQ}) to simulate such a computation on Q_{n+1}, but it would be inefficient, since one half of hypercube nodes, simulating the internal vertices of CBT_{n}, would idle most of the time. However, there is a trivial method how to simulate such a computation optimally on Q_{n}.
CAPTION: Simulation of 3level D&C computation on Q_{3}.
 the root node splits the problem (input data) into two halves,
 it hands over one half to its neighbor in dimension 0 and retains the other half,
 both these two nodes are active and do the same with their halves using dimension 1,
 this is repeated for dimensions i=2,...,n,
 all the nodes of Q_{n} become leaves of CBT_{n} and compute the leaf subproblems,
 results are folded back in the reverse order.
This corresponds to an embedding of CBT_{n} into Q_{n} as shown on Figure(a). Left children are mapped onto the same hypercube node as their parents. The embedded tree forms so called binomial spanning tree of the hypercube (see Figure(b)), which we will see again in Section 10 of this course.
Embeddings of meshesoftrees
Even though it is known that MT_{n} is even a subgraph of Q_{2n+2} (which is the optimal hypercube), we will only explain here an easy way how to embed MT_{n} into Q_{2n+2} using the results above. We just use these three facts:
 Q_{2n+2}=Q_{n+1} × Q_{n+1},
 MT_{n}\subsetCBT_{n}× CBT_{n},
 CBT_{n} can be embedded into Q_{n+1} with \dil=2 and \load=\ecng=1 (see Subsection \ref{StaticTreestoQ})).
This gives immediately an embedding of MT_{n} into Q_{2n+2} with \dil=2 and \load=\ecng=1:
 embed CBT_{n} into Q_{n+1} with \dil=2 and \load=\ecng=1,
 apply cartesian product to get an embedding of CBT_{n}× CBT_{n} into Q_{2n+2}
with \dil=2 and \load=\ecng=1,
 convert CBT_{n}× CBT_{n} into MT_{n} by removing superfluous edges.
CAPTION: MT_{2} embedded in Q_{6} with \load=\ecng=1 and \dil=2.
Embeddings of meshes
Again, the basic approach takes advantage of the fact that the hypercube and the meshes are defined recursively as cartesian products of lowerdimensional hypercubes and meshes, respectively.
Embeddings of hypercubic networks
Embeddings of other graphs
 Given graph G and integers k and n, the problem of embedding G into Q_{n} with dilation k is known to be NPcomplete.
 Embeddings of de Bruijn graphs with constant dilation and expansion into the hypercube are not known.
Embeddings into meshes and tori
2D or 3D meshes/tori are currently the most commonly used IN topologies, therefore methods how to embed other networks into them or how to simulate them on them are of high practical interest.
Lemma
Given a graph G, integers k and n, the problem to embed G into an ndimensional mesh with \dil<= k is NPcomplete\comment{\cite{bhattcosmakis}}. This holds even for k=1, n=2, and G = binary tree.
On the other hand, problems of embeddings into 2D meshes are very close to problems of designing VLSI layouts and many solutions/approaches are known\comment{\cite{ullman}}.
Embeddings among meshes and tori
Lemma
Meshes and tori are computationally equivalent, since they are quasiisometric.
Proof
Given integers z_{1},...,z_{n}, let M=M(z_{1},...,z_{n}) and T=T(z_{1},...,z_{n}).
 M is a subgraph of T so that T can simulate M with no slowdown.
 On the other hand, T can be embedded into M with \load=1 and \dil=\ecng=2.
 Decompose M into cartesian product M(z_{1})× ...× M(z_{n}) and T into T(z_{1})× ...× T(z_{n}).
 Embed each cycle T(z_{i}) into M(z_{i}) with \load=1 and \dil=\ecng=2 as shown on Figure(a) for z_{1}=5.
 The embedding of T into M is composed by using the cartesian product. Figure(b) shows 2dimensional case.
CAPTION: Optimal embedding of tori into meshes with \load=1 and \dil=\ecng=2.
If the mesh and torus do not have the same dimensionality or side lengths, we first embed tori into tori or meshes into meshes and then apply this solution.
Embeddings of tori into tori
In case of general higherdimensional rectangular and cube tori, the problem of optimal embeddings is hard and open. Optimal solutions are known only for special lowdimensional cases. We will mention one such result.
Lemma
Let z_{1}\not=z_{2}. Then 2D torus T(z_{1},z_{2}) can be embedded into cycle T(z_{1}z_{2}) with \load=1 and with optimal \dil=\ecng=min(z_{1},z_{2}).
Proof
The embedding is simple and straightforward: it is based on onetoone vertex mapping of T(z_{1},z_{2}) which is either rowmajor if z_{1}>= z_{2} or columnmajor otherwise. See Figure for an example.
CAPTION: Optimal embedding of 2D torus T(3,4) into cycle T(12) with \dil=3=min(3,4) based on columnmajor vertex mapping. Horizontal edges of T(3,4) have \dil=3 and vertical wraparound edges have \dil=2.
Embeddings of meshes into meshes
Here we will explain some methods how meshes can simulate each other if they are of the same size but differ in the number of dimensions and/or the lengths of sides. It is again a difficult problem in general, we will discuss only several special cases for which solutions are known. A mesh with nonequal side lengths is called rectangular here. The notation for cube meshes will indicate the dimensionality with a subscript at M. For example, M_{d}(w,...,w) denotes a ddimensional cube mesh.
Embeddings of cube meshes into rectangular meshes
Theorem
Let n>= 2 and let z_{1}>= ...>= z_{n}>= 2 be different integers. Let N=\Pi_{i=1}^{n} z_{i}, w=N^{1/ n}, R=M(z_{1},...,z_{n}), and S=M_{n}(w,...,w). Without loss of generality, assume that w is integer. Then
R can simulate S with slowdown {max_{i} z_{i}/ w}.
Proof
Inductive by showing that there exists an embedding of S into R with \dil=\ecng=O((max_{i} z_{i})/w) and constant load.
 (i) Induction basis:
 Let n=2 and z_{1} > z_{2}. Then S=M(w,w) can be embedded into R=M(z_{1},z_{2}) with \dil=[ z_{1}/w], \load=2, and \ecng=O([ z_{1}/w]). We show here one possible solution, but there are other ones with similar properties. Columns of S are placed successively into R lefttoright along the first dimension besides each other without overlapping. Each of them starts in the first row of R and snakes vertically through R rightward. See Figure. Since some nodes of R are wasted, the last column of R is reached before all columns of S are embedded, but in that case the snaking simply bounces mirrorlike back. Some nodes of R will have \load=2 and some \load=0. The width of one snake in R is [ w/z_{2}]=[ z_{1}/w]=[max(z_{1},z_{2})/w] and this is the dilation of horizontal edges of S and link congestion of horizontal links in R is 1+z_{1}/w.
CAPTION: Embedding of a 2D square mesh into a 2D rectangular mesh.
 (ii) Auxiliary lemma:
 \
Lemma
M(z_{1},...,z_{n2},z_{n1}z_{n}) is a subgraph of M(z_{1},...,z_{n1},z_{n}).
Proof
We use again a simple snakelike embedding. M(z_{n1}z_{n}) can be snaked through M(z_{n1}, z_{n}) with unit measures as shown on Figure and the general case follows from the cartesianproduct property.
CAPTION: Snaking M(z_{n1}z_{n}) through 2dimensional M(z_{n1},z_{n}).
 (iii) Induction step:
 Decompose R into w submeshes R_{i}=M([ z_{1}/w],z_{2},...,z_{n}), i=1,...,w, and S into w submeshes S_{i}=M_{n}(1,w,...,w). The embedding of S into R is composed of onetoone embeddings of S_{i} into R_{i} along the direction of z_{1} in R (see Figure). Since the width of R_{i} along the first dimension is [ z_{1}/w], the edges in the first dimension of S have dilation [ z_{1}/w]. Individual embeddings of S_{i} into R_{i} for i=1,...,w are by induction. Since each S_{i} is isomorphic to an (n1)dimensional cube mesh M_{n1}(w,...,w), we can embed it by induction into (n1)dimensional mesh M([ z_{1}z_{n}/w],z_{2},...,z_{n1}) with dilation
[{max([ z_{1}z_{n}/w],z_{2})/ w}]<=[ z_{1}/w],
\ecng=O(z_{1}/w), and \load=2. By Lemma, (n1)dimensional mesh M([ z_{1} z_{n}/w], z_{2}, ...,z_{n1}) is a subgraph of ndimensional mesh R_{i}=M([ z_{1}/w],z_{2},...,z_{n}). It follows that each S_{i} can be embedded into R_{i} with \dil=[ z_{1}/w], \ecng=O(z_{1}/w), and \load=2.
CAPTION: Recursive embedding of ndimensional S into R.
Corollary
S can be simulated by (nk)dimensional mesh M_{nk}(N^{1 / nk},...,N^{1 / nk}) with slowdown N^{k / n(nk)}.
Proof
By Theorem, S can be embedded into R=M_{n}(N^{1 / nk},...,N^{1 / nk},1,...,1) with dilation
{N^{1 / nk}/ N^{1 / n}}=N^{k / n(nk)}
and R is isomorphic to (nk)dimensional M_{nk}(N^{1 / nk},...,N^{1 / nk}).
Embeddings of rectangular meshes into cube meshes
Theorem
Let R and S be meshes defined as in Theorem. Then
R can be embedded into S with constant dilation.
Proof
 (i) Induction basis:
 Let n=2. Then R=M(z_{1},z_{2}) can be embedded into S=M(w,w) with \dil=1 and \load=\ecng=2 by a trivial snaking combined with perpendicular overlap bending at vertical boundaries of S as shown on Figure. Note that the nodes of S close to its vertical boundaries have \load=0 or \load=2.
CAPTION: Embedding of M(z_{1},z_{2}) into M(w,w) with \dil=1.
 (ii) Induction step:
 Let R'=M(z_{2},...,z_{n}), S'=M_{n1}(w,...,w), r=V(R'), and s=V(S'). Since z_{1} > w, it follows that r={N / z_{1}} < s={N / w}. Thus {s / r}=z_{1}/w > 1. By definition, R=M(z_{1})× R' and S=M(w)× S'. We split (n1)dimensional mesh S' into [{s / r}] cube submeshes S_{i}'=M_{n1}(r^{1 / n1},...,r^{1 / n1}) (Figure(c)). R can be split along the first dimension into z_{1} (n1)dimensional submeshes R_{i}, isomorphic with R', i.e., of size r each (Figure(a)).
CAPTION: Embedding of ndimensional rectangular mesh into ddimensional cube mesh with \dil=1.
By induction, R_{1} can be embedded into S_{1}' with unit dilation. The rest of mesh R is reshaped in the same way so that R can be viewed as a mesh M_{n}(z_{1},r^{1 / n1},...,r^{1 / n1}) (Figure(b)). This folded mesh is then snaked through S in the way depicted in Figure(c).
Embeddings among 2D rectangular meshes
A practically important problem is embedding of one 2D mesh into another one of approximately the same size with \load=1. There exist some partial solutions\comment{\cite{ellis}}. We will not describe the methods here due to the lack of space. They are usually based on small primitive tiles from which embeddings of larger meshes are constructed by composing, squeezing, and folding.
Lemma
Consider M(a,b) and M(a',b') such that a' < a<= b and b' is the minimum integer satisfying a'b'>= ab.
 This result is optimal, since the lower bounds on dilation and congestion are 2.
 If a/a'<= 2, then M(a,b) can be embedded into M(a',b') with \dil<=2.
 If a/a'<= 3, then M(a,b) can be embedded into M(a',b') with \dil<=3.
CAPTION: Embedding of M(2,9) into M(3,6) with \dil=2 and \ecng=3.
Simulation of parallel binary D&C computation on 2D meshes
Lemma
CBT_{2m} can be embedded into M(2^{m},2^{m}) with bidirectional links so that
 \load=2: Each mesh node is loaded with exactly one leaf vertex and (except one) exactly one internal vertex
 distinct tree edges are mapped on arcdisjoint links in the mesh
 \dil=2^{m}+O(m)
Proof
The embedding is again nicely recursive. The basic idea follows from Figure. The induction base is in parts (a)(c). Part (c) shows CBT_{4}, which is embedded in M(2^{2},2^{2}) using either pattern A or B. Note than in each pattern, one node is loaded with only one leaf and can therefore become a new root when building larger trees from these building blocks. Also there are some free arcs which can be used for joining subtrees to new roots. Part (d) illustrates the induction step.
CAPTION: Simulation of D&C computation on a square mesh.
Embeddings of linear arrays and cycles into highdimensional meshes and tori
 Hamiltonian paths/cycles:
 We have seen in Section 5 that
 any T(z_{1},...,z_{n}) has a hamiltonian cycle,
 any M(z_{1},...,z_{n}) has a hamiltonian path,
 M(z_{1},...,z_{n}) has a hamiltonian cycle iff at least one z_{i} is even.
 Recursive patterns:
 Cube meshes M_{n}(k,...,k), where k=2^{i} for some integer i have hamiltonian paths/cycles with a nicely recursive structure. Let us demonstrate for 2D mesh M_{i}=M(2^{i},2^{i}) one solution which is suitable for example for meshoftrees or pyramidal meshes.
 Induction base:
 M_{1} is trivial (see Figure(a)).
 Induction step:

 split M_{i} into four quarters M_{i1},
 construct hamiltonian path recursively in each of them, so that the upper two patterns are rotated by 90 degrees as shown in Figure(b),
 join the terminal vertices of these paths as shown in the same figure.
 Peano curves:
 If large dilation is not an issue (for example in case of distance insensitive routing), meshes can be traversed in a fractallike way. Very useful are so called Peano curves (see Figure(c)). They have dilation \dil=\Theta(\sqrt{N}), but their fractal structure does not require rotation, it needs just a translation.
CAPTION: Recursive numbering schemes for square 2D meshes.
Linear arrays and cycles as the most trivial topologies can be embedded with constant measures into any network, this is a classical result by Sekanina\comment{\cite{sekanina}}.
\begin{lemma}
An Nvertex cycle C can be embedded into any Nnode network G with \load=1, \dil<= 3, and \ecng=2.
\end{lemma}
Proof
Constructive
 construct a spanning tree T_{G} of G,
 partition its nodes into two classes V_{0} = nodes at evennumbered levels of T_{G} and V_{1} = nodes at oddnumbered levels of T_{G},
 embed C in G by traversing T_{G} in the depthfirst order and by placing vertices of C into nodes of T_{G} using the following rule:
 the currently visited node x of T_{G} is in V_{1} and it is our first visit of x,
 the currently visited node x of T_{G} is in V_{0} and it is our last visit of x.
Each node of T_{G} will be loaded exactly once. The proof that \dil=3 follows from the recursive structure of any spanning tree.
CAPTION: Embedding an Nvertex cycle into Nnode network G with \dil=3 and \load=1.
Lemma
Both types of butterflies and CCC are computationally equivalent.
Proof
 CCC_{n} is a spanning subgraph of wBF_{n}.

The embedding is based on a onetoone vertex mapping \varphi:V(CCC_{n})> V(wBF_{n}) defined as \varphi((i,x))=(i+_{n} \parity(x),x), where \parity(x) returns 1 if x has odd parity, and 0 otherwise. Rotation of all oddnumbered cycles of CCC_{n} by one position forward gets them into the same position as in wBF_{n}.
CAPTION: CCC_{3} (thick lines) spans wBF_{3}.
 wBF_{n} can be embedded into CCC_{n} with \dil=\ecng=2.
 The argument is based on vertex symmetry of both topologies and idempotent vertex mapping of each elementary butterfly 2× 2 onto CCC paths consisting of 3 edges: cycle+hypercube+cycle (see Figure). So, CCC_{n} can simulate wBF_{n} with slowdown 2.
CAPTION: The idea of embedding wBF_{n} into CCC_{n} with \dil=\ecng=2 and \load=1. Here, x'=\non_{i}(x).
 oBF_{n} can be embedded into wBF_{n} with \load=2 and \dil=1.
 Just by merging terminal vertices of rows oBF_{n}, we get cycles if wBF_{n}. Hence wBF_{n} can simulate oBF_{n} with slowdown 2.
 wBF_{n} can be embedded into oBF_{n} with constant dilation.
 Proof omitted.
There are other two important results related to hypercubic networks which we will mention here.
Lemma
 an Nnode hypercubic network can execute an Nnode normal hypercube algorithm with only a constant slowdown
 an Nnode butterfly can simulate any boundeddegree Nnode network with slowdown O(log N) supposing some preprocessing is allowed.
We will not prove here the first statement. The second statement will be proven in Section 9 of this course.
The class of normal hypercube algorithms comprises a large number of important algorithms, e.g., matrix computations, sorting, collective communication algorithms.
Definition
A hypercube algorithm is said to be normal if only one dimension of hypercube edges is used at any step of the algorithm and if consecutive dimensions are used in consecutive steps.
Embeddings into shufflebased topologies
Lemma
dB_{2,n} and SE_{n} are computationally equivalent.
Proof
 SE_{n} is a spanning subgraph of dB_{2,n}.

The embedding is again based on onetoone vertex mapping \varphi:V(SE_{n})> V(dB_{2,n}) such that \varphi(u)=\sigma^{\parity(u)}(u) where \parity is parity is as in Lemma and \sigma is the shuffle operation. Hence, we simply unshuffle all necklaces with vertices with odd parity. Figure shows the undirected case for n=3.
CAPTION: Undirected SE_{3} is a spanning subgraph of undirected dB_{2,n}.
 dB_{2,n} can be embedded into SE_{n} with \load=1 and \dil=2.

the de Bruijn operation of leftshifting accumulates the effects of both shuffle and exchange operations and the argument follows just from using the idempotent vertex mapping.
But there are other interesting structural properties.
Lemma
dB_{2,n} is isomorphic to SE_{n+1} with contracted exchange arcs.
Proof
If we splice every pair of vertices u_{n}... u_{1}0 and u_{n}... u_{1}1 in SE_{n+1} into vertex u_{n}... u_{1}, we get dB_{2,n}.
Lemma
Undirected SE_{n} and dB_{2,n} can simulate ndimensional normal hypercube algorithms with almost no slowdown.
Proof
Assume a normal hypercube algorithm using consecutively dimensions J=[j_{1},j_{2},...]. Two consecutive dimension numbers j_{i} and j_{i+1} differ by 1 modulo n.
 SE_{n} can simulate a tstep normal Q_{n} algorithm in 2t1 communication steps.

The initial mapping of hypercube nodes onto undirected SE_{n} is such that each node u_{n1}... u_{0} is mapped to u_{j11}... u_{0} u_{n1}... u_{j1}. Then the first step of the normal hypercube algorithm can be directly simulated, since the nodes who need communicate in Q_{n} are linked by exchange edges in SE_{n}. Assume j_{2}=j_{1}+1. Using the shuffleedges, we first move every hypercube node u_{n1}... u_{0} from u_{j11}... u_{0}u_{n1}... u_{j1} to u_{j1}... u_{0}u_{n1}... u_{j1+1}=u_{j21}... u_{0}u_{n1}... u_{j2}. In other words, we shuffle all necklaces. Then we can again simulate the hypercube communication along dimension j_{2} in one step using the exchange arcs in SE_{n}. (If j_{2}=j_{1}1, we would unshuffle necklaces.) The simulation proceeds in this way for all subsequent steps. So one step of the hypercube algorithm is simulated by two steps of undirected SE_{n}. If the overhead of the initial mapping is ignored, the total number of steps of the simulation is 2t1.
 dB_{2,n} can simulate a tstep normal Q_{n} algorithm in t communication steps and 2t computational steps.
 It we contract exchange edges into vertices to get dB_{2,n1} from SE_{n}, the same vertex mapping enables the same simulation to be performed on undirected dB_{2,n1} in t communication steps since the communication along the exchange edges is internal within de Bruijn nodes. Of course, one de Bruijn node simulates two hypercube nodes, so that the local processing is twice of that in the hypercube.