Next: Topologically-guided Clustering Up: Cache Placement Methods Based Previous: Introduction

Related Work

Our clustering algorithm is analogous to generating a spanning tree for the AS graph. Prim's algorithm [22] is a standard method for constructing a minimum spanning tree if the root of the tree is known in advance. Starting at the root, use a breadth-first search to find all nodes. Each edge that lies on a shortest path from the root to any other node is a member of the minimum spanning tree. Our graph does not have a pre-defined root, and in fact, one of our goals is discover dynamically the backbone of the Internet. Kruskal's algorithm [16] does not require a starting point. It constructs a spanning forest that initially contains a tiny tree for each vertex. Trees are then combined by coalescing them at the shortest edges first. Any edge that does not cross between trees is redundant and any edge left over after all of the vertices have been visited are similarly not needed. Although Kruskal's algorithm serves as the inspiration for our algorithm, we still had to address the stopping criteria, since declaring a single root for the entire Internet would have artificially added several hops in the core of the Internet, where a dozen of the biggest transit providers are almost completely interconnected. Kruskal's algorithm finds a minimal spanning tree in the sense that the total of the edge lengths is minimized, even if it makes the tree deep. Our goal was subtly different, since we want a tree that has maximum fidelity to the traffic flow in the Internet. In particular, we want a shallow tree so that node representations are not mistakenly far from the backbone. Even outside the core, Kruskal's algorithm produces trees that are inappropriately deep when presented with neighborhoods of completely interconnected vertices.

Initial work on clustering clients and proxy placement was done by Cuñha in [5]. That work described a process of using traceroute to generate a tree graph of client accesses (using IP addresses collected from a Web server's logs). Proxies were then placed in the tree using three different algorithms and the effects on reduction of server load and network traffic were evaluated. Our work differs from this in our use of AS level information from BGP routing tables to create a tree which is simpler and more efficient. Our cache placement algorithms differ in that the coarser aggregation allows us to use a method that guarantees optimal placement. The next significant work on client clustering was done by Krishnamurthy and Wang in [15]. In that work, the authors merge the longest prefix entries (i.e.. those with the most detail) from a set of 14 BGP routing tables. This creates a prefix/netmask table of approximately 390K possible clusters. IP addresses from Web server logs are then clustered by finding the longest prefix match in the prefix/netmask table. While this approach generates client clusters which are topologically close and of minimal size, it does not provide for further levels of aggregation of clusters.

Content distribution companies (e.g.. Akamai, Digital Island, epicRealm) and wide area load balancing product vendors (e.g.. Cisco, Foundry and Nortel) also use the notion of client clustering to redirect client requests to distributed caches. These companies use the Domain Name System (DNS) [19] as a means for both determining client location and redirecting requests. The assumption made in DNS-redirection is that clients whose DNS requests come from the same DNS server are topologically close to each other.

Initial work in [14] evaluates the performance of redirection schemes that access documents from multiple proxies versus a single proxy and shows that retrieving embedded objects from a single page from different servers is sub-optimal. Subsequent work in [27] indicates that clients and their nameservers are frequently neither topologically close nor close from the perspective of packet latency. However, Myers et al. show that the ranking of download times of the same three sites from 47 different mirrors was stable [20].

Caching has been widely studied as a means for enhancing performance in the Internet during the 1990's. These studies include cache traffic evaluation [1,2], replacement algorithm performance [30,6], cache hierarchy architecture [9,18] and cache appliance design [3,4]. A number of recent papers have addressed the issue of proxy placement based on assumptions about the underlying topological structure of the Internet [17,13,24]. In [17], Li et al. describe an optimal dynamic programming algorithm for placing multiple proxies in a tree-based topology. Their algorithm is comparable to ours although it is less efficient. It places proxies in a tree with nodes and operates in time. Jamin et al. examine a number of proxy placement algorithms under the assumption that the underlying topological structure is not a tree. Their results show quickly diminishing benefits of placing additional mirrors (defined as proxies which service all client requests directed to them) even using sophisticated and computationally intensive techniques. In [24], Qiu et al. also evaluate the effectiveness of a number of graph theoretic proxy placement techniques. They find that proxy placement that considers both distance and request load performs a factor of 2 to 5 better than a random proxy placement. They also find that a greedy algorithm for mirror placement (one which simply iteratively chooses the best node as the site for the next mirror) performs better than a tree based algorithm.

Both router level and inter-domain topology have been studied over the past five years [11,21,28,10,7]. Our clustering algorithm uses BGP data thus inter-domain topology is most relevant to this work. In [10], Govindan and Reddy characterize inter-domain topology and route stability using BGP routing table information collected over a one year period. In that work the authors describe inter-domain topology in terms of diameter, degree distribution and connectivity characteristics. Inter-domain routing information can be collected from a number of public sites including NLANR [8], Merit [12] and Route Views [29] (our source of routing information). These sites provide BGP tables from looking glass routers located in various places in the Internet and peered with a large number in ISP's.

Next: Topologically-guided Clustering Up: Cache Placement Methods Based Previous: Introduction

Jim Gast 2001-04-20