In this paper, we address the question of client clustering by presenting a new method that generates a hierarchy of client clusters. As opposed to recent work on IP client clustering described in [15], our method uses autonomous systems (AS's) as the basic cluster unit. We argue that clustering at the IP level results in cluster units which are too detailed, too numerous and do not readily lend themselves to higher levels of aggregation. In contrast, clustering at the AS level provides a natural means for not only identifying clients which should experience similar performance from a given cache but also for aggregating AS's into larger groups which should experience similar performance.
Our interest in the ability to aggregate AS's stems from the desire
to clearly understand demand and to effectively distribute caches.
Our clustering method enables groups of AS's to be coalesced into larger
groups based on best path connectivity extracted from BGP routing tables.
Specifically, our method considers the Hamming distance between AS's and
their peers and coalesces AS's with minimal Hamming distance.
For example, if
has peers
and
, and
has peers
and
,
and
, then
could be coalesced into
and
would
become the exemplar for that group.
Clearly, there is no strict
hierarchy in the AS structure of the Internet so our algorithm works by
successively relaxing the Hamming distance requirements for clustering.
The result is that our tree structure does not reflect precise connectivity
characteristics between AS's. However the average number of AS hops to what we
classify as backbone AS's is 1.61 while the average tree depth in our graph is
1.96 thus, our approximation of a tree does not increase AS hop distance
significantly.
The nature of our clustering algorithm is such that if we repeatedly apply it to a BGP data set, we would eventually collapse the entire network to a tree with a single root node. Our intention, however, is to only collapse the topology to a size which readily enables evaluation of demand and facilitates our cache placement algorithms. The result of our clustering algorithm presented in this paper is a set of 21 root AS trees. These root AS's consist of many of the major ISPs such as BBNPlanet and AT&T, but also some smaller ISPs such as LINX due to the nature of the algorithm. The root AS's connect on average with 7.29 other root AS's indicating a high level of connectivity between these nodes. The average out-degree of the root AS's (i.e.. the number of AS with whom they peer) is 198 with a median of 97 indicating that the root AS's facilitate Internet access to a large number of other AS's. These characteristics indicate that while a tree is an idealization of the actual AS topology, it does not abstract away detail essential in applying this structure in a number of areas.
One domain in which our tree hierarchy of AS's naturally lends itself is
cache placement. Since our tree generation algorithm is based on
best path information from BGP tables, it enables caches to be placed on
AS hop paths which
would actually be used in the Internet. Placement of caches in trees has
been treated as a dynamic programming problem in [17] however the
means by which trees were created was not treated in that work.
We address the issue of optimal cache placement by describing a dynamic programming
algorithm in which each subtree calculates the optimal use for 0 to caches
in its subtree. Each parent node can then discover the maximum benefit from
caches by distributing all of the caches among its children or by
retaining one cache for itself.
We evaluate the effectiveness of our optimal cache placement algorithm by comparing it to the total cost of traffic when 0 to 50 caches are placed using a greedy algorithm and a random placement. We find that optimal placement of a small number of caches does measurably better than the other techniques, but that greedy placement performs surprisingly close to optimal when more caches are deployed.
The remainder of this paper is organized as follows: Section 2 discusses related work; Section 3 describes our process for constructing client clusters using BGP routing data; and Section 4 describes the results of evaluating client demand from a Web log using our clustering results. In Section 5 we present our algorithms for optimally placing caches based on client demand distribution, and in Section 6 we demonstrate the effectiveness of our cache placement methods. In Section 7, we summarize our results and conclude with directions for future study.