Next: Related Work Up: Cache Placement Methods Based Previous: Cache Placement Methods Based

Introduction

Content delivery providers (CDNs) have distributed networks of caches in the Internet as a means for reducing load on Web servers, reducing network load for Internet Service Providers (ISPs) and improving performance for clients. In order to effectively deploy and manage cache and network resources, CDNs must be able to accurately identify areas of client demand. One means for doing this is by clustering clients that are topologically close to each other, and then placing caches in the areas where demand is typically large. This raises two immediate questions: how can clusters of clients be generated and once identified, how can caches be placed among the clusters so as to maximize their impact?

In this paper, we address the question of client clustering by presenting a new method that generates a hierarchy of client clusters. As opposed to recent work on IP client clustering described in [15], our method uses autonomous systems (AS's) as the basic cluster unit. We argue that clustering at the IP level results in cluster units which are too detailed, too numerous and do not readily lend themselves to higher levels of aggregation. In contrast, clustering at the AS level provides a natural means for not only identifying clients which should experience similar performance from a given cache but also for aggregating AS's into larger groups which should experience similar performance.

Our interest in the ability to aggregate AS's stems from the desire to clearly understand demand and to effectively distribute caches. Our clustering method enables groups of AS's to be coalesced into larger groups based on best path connectivity extracted from BGP routing tables. Specifically, our method considers the Hamming distance between AS's and their peers and coalesces AS's with minimal Hamming distance. For example, if has peers and , and has peers and , and , then could be coalesced into and would become the exemplar for that group. Clearly, there is no strict hierarchy in the AS structure of the Internet so our algorithm works by successively relaxing the Hamming distance requirements for clustering. The result is that our tree structure does not reflect precise connectivity characteristics between AS's. However the average number of AS hops to what we classify as backbone AS's is 1.61 while the average tree depth in our graph is 1.96 thus, our approximation of a tree does not increase AS hop distance significantly.

The nature of our clustering algorithm is such that if we repeatedly apply it to a BGP data set, we would eventually collapse the entire network to a tree with a single root node. Our intention, however, is to only collapse the topology to a size which readily enables evaluation of demand and facilitates our cache placement algorithms. The result of our clustering algorithm presented in this paper is a set of 21 root AS trees. These root AS's consist of many of the major ISPs such as BBNPlanet and AT&T, but also some smaller ISPs such as LINX due to the nature of the algorithm. The root AS's connect on average with 7.29 other root AS's indicating a high level of connectivity between these nodes. The average out-degree of the root AS's (i.e.. the number of AS with whom they peer) is 198 with a median of 97 indicating that the root AS's facilitate Internet access to a large number of other AS's. These characteristics indicate that while a tree is an idealization of the actual AS topology, it does not abstract away detail essential in applying this structure in a number of areas.

One domain in which our tree hierarchy of AS's naturally lends itself is cache placement. Since our tree generation algorithm is based on best path information from BGP tables, it enables caches to be placed on AS hop paths which would actually be used in the Internet. Placement of caches in trees has been treated as a dynamic programming problem in [17] however the means by which trees were created was not treated in that work. We address the issue of optimal cache placement by describing a dynamic programming algorithm in which each subtree calculates the optimal use for 0 to $\ell$ caches in its subtree. Each parent node can then discover the maximum benefit from $\ell$ caches by distributing all of the caches among its children or by retaining one cache for itself.

We evaluate the effectiveness of our optimal cache placement algorithm by comparing it to the total cost of traffic when 0 to 50 caches are placed using a greedy algorithm and a random placement. We find that optimal placement of a small number of caches does measurably better than the other techniques, but that greedy placement performs surprisingly close to optimal when more caches are deployed.

The remainder of this paper is organized as follows: Section 2 discusses related work; Section 3 describes our process for constructing client clusters using BGP routing data; and Section 4 describes the results of evaluating client demand from a Web log using our clustering results. In Section 5 we present our algorithms for optimally placing caches based on client demand distribution, and in Section 6 we demonstrate the effectiveness of our cache placement methods. In Section 7, we summarize our results and conclude with directions for future study.

Next: Related Work Up: Cache Placement Methods Based Previous: Cache Placement Methods Based

Jim Gast 2001-04-20