Scripts accompanying the Summary Cache talk

Slide 1:

Today I will talk about the Summary Cache protocol, which we designed to support scalable cache sharing in a wide-area environment. The work is done jointly with my students, Li Fan and Jussara Almeida, and Dr. Andrei Broder at Compaq System Research Center.

Slide 2:

First, a bit of background. Web caching is becoming increasingly popular these days. It promises to reduce Internet traffic, reduce Web server load, and reduce client latency. THere are typically two kinds of caches, browser cache and proxy cache. Proxy cache sits at gateways and service a large user community. Because they serve a large user group, Proxy caches typically achieve good hit ratio and are particularly effective. Today, they are becoming more and more common.

Slide 3:

As proxy caches become common, the benefits of sharing cache contents among proxy caches become apparent. For example, in European countries, different universities and ISPs have their own proxy caches, yet they share a common bottleneck: the links to North America. If each proxy makes its cache content available to other proxies and service each other's cache misses when they can, then the total traffic to North America will be reduced, benefiting all parties. Because of this, cache sharing among proxies is widespread in Europe and Asia.

Slide 4:

The protocol they use is ICP. The way it works is the following: once a proxy has a cache miss, it will sent queries to all its neighbors, and the parent cache, if there is one, and ask "do you have this URL?". If any of them responds with "Yes", then the proxy fetches the document from it. If nobody responds with "Yes" within a certain time limit, the proxy sends the request to the Web server.

As you can see, ICP is a very simple protocol, and allows the caches to be very loosely coupled. It is instrumental in encouraging the practice of cache sharing, and it have been very successful.

Slide 5:

However, ICP is not without its defects. The main problem is that ICP leads to many query messages, and processing these query messages takes a lot of CPU time. Everytime one proxy has a cache miss, everyone else has to process a query message.

To quantify the overhead, we did a simple experiment: we used four SPARC20 workstations connected with a 100Base T Ethernet to act as four proxies, running the Squid proxy software. We feed both synthetic benchmark and traces to the proxies, and we compare the performance of the proxies when there is no cache sharing and when the proxies share caches with each other using ICP. We find that even with just 4 proxies, the ICP's overhead is sustantial: the query messages increase the network packets each proxy has to process by 13-30%, and increases the CPU overhead by 8-25%. As a result, the client latency is increased from 2-12%, even when there are remote cache hits. Due to time constraints, I would not elaborate more. The detail of the experiments can be found in the paper.

Clearly, ICP is not scalable. What are the alternatives?

Slide 6:

well, one alternative is to force all users to go through the same cache or the same array of caches. Though this works sometimes, it is not always feasible. For example, the proxies may go through different ISPs to connect to North America, and you can't find a central location to put this cache. Another alternative is to have a central directory server that knows what URLs are cached at every proxy, and each proxy just needs to ask the directory server. However, the directory server can easily become a bottleneck.

Ideally, we want a protocol that is as loosely coupled as ICP, keeps the inter-proxy cache hit ratio high, because that's the whole purpose behind cache sharing, and minimize the inter-proxy messages, and it should be able to scale to a large number of proxies.

Slide 7:

We propose Summary Cache. The idea is that each proxy keeps a directory of what URLs are cached at other proxies, and when there is a cache miss, probe this directory first before sending out queries. In other words, the directory acts as a filter to reduce query messages.

There are two problems with this idea. First, how to keep the directories up to date? If a proxy has to send out update messages everytime it puts a new document in cache, then we didn't save any message compared to ICP. The solution is to delay and batch the updates, that is, the directory may get slightly out of date, but hopefully it would hurt much in terms of performance. The second issue is the storage requirements. For performance reasons one want to keep the directories in DRAM. However, a proxy often holds over 1 million Web pages, and each URL on average is 40 bytes long. That means 40MB for each neighboring proxy, and if you have 16 neighbors, that's 640MB of DRAM. The solution to this is to compress the directories. That is, the directory does not have to be precise, but should be inclusive, meaning that it indicates a larger set of URLs than what are cached.

For summary cache to be a success, we need to have two things:
1. the delay in updating the directories will not cause the total cache hit ratio to drop excessively;
2. we can find a way to compress the directory so that it takes little memory and doesn't yield too many query messages;

But before we look into these, let's analyze the mistakes introduced by the approximations;

Slide 8:

There are three kinds of mistakes. The first one is false misses. This is the case when there is a cache miss at proxy A, the URL is cached at B, but A does not know, and misses the oppurtunity of a remote cache hit. This is mainly introduced by the delay in updating directories, and its effect is lower total cache hit ratio. The second one is false hits. This is the case when the URL is not cached at B, but A thinks it is, and sends a query message, only to be told that it is not there. This is mainly caused by the compression of the directory, and its effect is wasted inter-proxy messages. The third one is stale hits. This is the case when the URL is cached at B, A fetches it, but B's copy is stale. This is caused by the lack of invalidation mechanism in today's Web caches, and its effect is wasted inter-proxy message.

As you can see, the delay in directory update affects the total cache hit ratio,and the compression of directories affects the inter-proxy traffic. Now let's look at the two.

Slide 9:

We use trace-driven simulation to examine the effect of update delay. Here I show the results from two traces, one from the proxy at DEC, and another from a regional ISP at Australia called QuestNet. On the x-axis I am showing you the threshold for update delay. The way we determine when it is time to update the directories is when the percentage of the Web pages that are in a proxy's cache but not reflected in its directory at other proxies reaches a certain threshold. Hopefully, the false miss ratio is proportional to the threshold. On the y-axis I am showing you the total hit ratio under ICP, and under summary cache. As you can see, even with 5% or 10% threshold for update delay, the effect on total hit ratio is quite small, only about 5%(CHECK!). The threshold translates to updating directories every 10 min to an hour in our traces. So the results clearly show that summary cache is promising.

Slide 10:

Now, how do we compress the directory? We already know that representing the directory as a list of document URLs takes too much memory. Our first try is to simple use the server portion of the URLs. Since one Web server typically has many documents, we got a factor of 10 reduction in memory requirement in our traces. However, this compression is too imprecise, it generates too many false hits, and as a result there are too many inter-proxy messages.

Slide 11:

Then at that time, I was visiting DEC SRC, and I talked to my friend Monica Herzinger. I said:"Monica, here is my problem. Suppose I have a bag of integers. How can I come up with a compact representation of this bag of integers, so that I can ship this representation somewhere else, and at that somewhere else, I can just probe this representation and know whether an arbitrary integer is in this bag or not?" (well, bag of URLs and bag of integers are the same thing.)

And she said: "talk to Andrei, he knows about this". And Andrei introduced me to a real neat technique called Bloom filters.

Slide 12:

The way bloom filter works is the following. You construct a large bit array. Initially all bits are 0. Then you pick a number of independent hash functions; in this case 4. Then for every URL in the bag, you apply the four hash functions, getting four integers, then you use the four integers as indices into the bit array, and turn all four bits to 1. You do this for every URL in the bag.

Then, you ship this bit array and the specification of the hash functions to the remote site. Then at the remote site, to see whether a URL is in this bag, you apply the four hash functions to the URL to get four indices, and then you check to see whether all four bits are 1. If any one of the four bits are 0, then the URL is definitely not in the bag. If all of them are 1, then there is a high probability that the URL is in the bag.

The nice thing about Bloom filters is that there is a very nice tradeoff between the size of the bit array and the false positive ratio. As you increase the size of the bit array, the false positive ratio drops exponentially.

Slide 13:

To see why that is the case, let's take a look at the math. Suppose we have n URLs, and we pick the size of the bit array, m, which should be bigger than two times n. Now, what is the right number of hash functions, k?

On the one hand, the larger the k, the more tests the remote site has to do, and it reduces the false positive ratio. On the other hand, the larger the k, the more bits are turned to 1 in the bit array, and the more chances for false collision in the remote site.

As you can imagine, it turns out that k is optimal when exactly half of the bits are 1 in the bit array. After some math calculations, it means that the optimal k is ln(2)*m/n, or an integer close to it.

And since exactly half of the bits are 0, the false positive ratio at the remote site is 1/2 to the power of k. You plug the formula in, and you get that with optimal k, the false positive ratio is 0.62 to the power of m/n.

Typically, for m/n = 6, that is, using 6 bits per URL, the false positive ratio is about 6%, and for m/n = 8, the false positive ratio is about 2%.

Note: I got lazy here and didn't write down the rest of the notes. But you should have got the main ideas by now and you can get the rest from the paper.

Slide 14:

< explain hash functions built by MD5;>
< explain local counter array;>
< explain updates;>

Slide 15:

explain why BF-8 and BF-16 are so close;

Slide 16:

< see, we still get pretty good hit ratios with bloom filter based summary. >

Slide 17:

We implemented a prototype in Squid 1.1.14. The details are described in the paper.

Experimental results; MD5 calculation isn't expensive;

Slide 18:

Conclusion.