next up previous
Next: Cluster generation example Up: Topologically-guided Clustering Previous: Why just use best

Multiple passes across the graph

Clustering is performed by successive passes through the graph building up large clusters by visiting small clusters and merging them into an existing larger cluster.

For each clustering pass, each node, $n$, without a parent (i.e. $p(n) = 0$ ) tries to find a suitable parent. Conceptually, the candidate parents are the nodes which dominate it, $C_n = \left\{ m \in N_n \vert dom(m,n) \right\}$. In practice, this is too strict a requirement and we will define $C_n$ more correctly below. Now, find the nearest among the candidate parents, $m \in C_n$. The best parent is

\begin{displaymath}nearest(n) = \min_{m \in C_n} \left\{ dist(n,m) \right\} \end{displaymath}

If $C_n \not= \emptyset$, Node $n$ is merged into the cluster of the best parent, $m$. Now $p(n)$ is set to $m$ and $n$ is removed from $N_m$. Note that $n$ is not removed from other neighbor lists, since $n$ might later be chosen as a parent by an even smaller cluster.

An interesting design decision happens in situations where $N_m = N_n$, neither neighbor list is a proper superset of the other and neither dominates. We defined domination in this way so both nodes are free to become siblings under some other parent, keeping the tree comparatively shallow. If $n$ or $m$ had been arbitrarily chosen as parent, the other (and her subtree) would appear to be one hop farther from the backbone.

It would also have been meaningful to define the best parent as the farthest candidate parent. This would cause AS's to choose AS's with very high out-degree as their preferred parent. The result would have been a shallower tree that more closely matches the distance to the backbone, but it also would have lost the useful categorization of AS's into clusters with very similar sets of neighbors.

In practice, many AS's connect to more than one major provider. These AS's are not strictly dominated by any one of the nodes they have links to. To relax the domination requirement, a tolerance factor grows with each pass through the unassigned nodes. The tolerance, $\delta$, allows a node to become a child of any node with a higher out-degree if the overhang is less than the current tolerance. So the actual computation for the set of candidate parents is:

\begin{displaymath}C_n = \left\{ m \in N_n \vert overhang(n,m) \leq \delta \wedge outdegree(m) > outdegree(n) \right\} \end{displaymath}


next up previous
Next: Cluster generation example Up: Topologically-guided Clustering Previous: Why just use best
Jim Gast 2001-04-20