next up previous
Next: Bloom Filters - the Up: Summary Cache Previous: Impact of Update Delays

Summary Representations

  The second issue affecting scalability is the size of the summary. Summaries need to be stored in the main memory not only because memory lookups are much faster, but also because disk arms are typically the bottlenecks in proxy caches [36]. Although DRAM prices continue to drop, we still need a careful design, since the memory requirement grows linearly with the number of proxies. Summaries also take DRAM away from the in-memory cache of hot documents, affecting the proxy performance. Thus, it is important to keep the summaries small. Fortunately, summaries only have to be inclusive (that is, depicting a superset of the documents stored in the cache) to avoid affecting the cache hit ratio.

We first investigate two naive summary representations: exact-directory and server-name. In the exact-directory approach, the summary is essentially the cache directory, with each URL represented by its 16-byte MD5 signature [38,22]. In the server-name approach, the summary is the list of the server name component of the URLs in cache. Since on average, the ratio of different URLs to different server names is about 10 to 1 (observed from our traces), the server-name approach can cut down the memory by a factor of 10.

We simulate these approaches using the traces and found that neither of them is satisfactory. The results are in Figure 7, along with those on another summary representation (Figure 7 is discussed in detail in Section 5.2). The exact-directory approach consumes too much memory. In practice, proxies typically have 8GB to 20GB of cache space. If we assume 16 proxies of 8GB each and an average file size of 8KB, the exact-directory summary would consume (16 -1) * 16 * (8GB/8KB) = 240MB of main memory per proxy. The server-name approach, though consuming less memory, generates too many false hits that significantly increase the network messages.

The requirements on an ideal summary representation are small size and low false hit ratio. After a few other tries, we found a solution in an old technique called ``Bloom filters.''


next up previous
Next: Bloom Filters - the Up: Summary Cache Previous: Impact of Update Delays
Pei Cao
7/5/1998