We first investigate two naive summary representations: exact-directory and server-name. In the exact-directory approach, the summary is essentially the cache directory, with each URL represented by its 16-byte MD5 signature [38,22]. In the server-name approach, the summary is the list of the server name component of the URLs in cache. Since on average, the ratio of different URLs to different server names is about 10 to 1 (observed from our traces), the server-name approach can cut down the memory by a factor of 10.
We simulate these approaches using the traces and found that neither of them is satisfactory. The results are in Figure 7, along with those on another summary representation (Figure 7 is discussed in detail in Section 5.2). The exact-directory approach consumes too much memory. In practice, proxies typically have 8GB to 20GB of cache space. If we assume 16 proxies of 8GB each and an average file size of 8KB, the exact-directory summary would consume (16 -1) * 16 * (8GB/8KB) = 240MB of main memory per proxy. The server-name approach, though consuming less memory, generates too many false hits that significantly increase the network messages.
The requirements on an ideal summary representation are small size and low false hit ratio. After a few other tries, we found a solution in an old technique called ``Bloom filters.''