Slide 1:
Today I will talk about the Summary Cache protocol, which we designed
to support scalable cache sharing in a wide-area environment. The work
is done jointly with my students, Li Fan and Jussara Almeida, and Dr.
Andrei Broder at Compaq System Research Center.
Slide 2:
First, a bit of background. Web caching is becoming increasingly
popular these days. It promises to reduce Internet traffic, reduce
Web server load, and reduce client latency. THere are typically two
kinds of caches, browser cache and proxy cache. Proxy cache sits at
gateways and service a large user community. Because they serve a
large user group, Proxy caches typically achieve good hit ratio
and are particularly effective. Today, they are becoming more and more
common.
Slide 3:
As proxy caches become common, the benefits of sharing cache
contents among proxy caches become apparent. For example, in
European countries, different universities and ISPs have their own
proxy caches, yet they share a common bottleneck: the links to North
America. If each proxy makes its cache content available to other
proxies and service each other's cache misses when they can, then
the total traffic to North America will be reduced, benefiting all
parties. Because of this, cache sharing among proxies is widespread in
Europe and Asia.
Slide 4:
The protocol they use is ICP. The way it works is the following:
once a proxy has a cache miss, it will sent queries to all its
neighbors, and the parent cache, if there is one, and ask "do you have
this URL?". If any of them responds with "Yes", then the proxy fetches
the document from it. If nobody responds with "Yes" within a certain
time limit, the proxy sends the request to the Web server.
As you can see, ICP is a very simple protocol, and allows the caches
to be very loosely coupled. It is instrumental in encouraging the
practice of cache sharing, and it have been very successful.
Slide 5:
However, ICP is not without its defects.
The main problem is that ICP leads to many query messages, and
processing these query messages takes a lot of CPU time.
Everytime one proxy has a cache miss, everyone else has to process
a query message.
To quantify the overhead, we did a simple experiment: we used four
SPARC20 workstations connected with a 100Base T Ethernet to act as
four proxies, running the Squid proxy software.
We feed both synthetic benchmark and traces to the proxies, and
we compare the performance of the proxies when there is no cache sharing
and when the proxies share caches with each other using ICP.
We find that even with just 4 proxies, the ICP's overhead is sustantial:
the query messages increase the network packets each proxy has to
process by 13-30%, and increases the CPU overhead by 8-25%. As a
result, the client latency is increased from 2-12%, even when there are
remote cache hits.
Due to time constraints, I would not elaborate more. The detail of the
experiments can be found in the paper.
Clearly, ICP is not scalable. What are the alternatives?
Slide 6:
well, one alternative is to force all users to go
through the same cache or the same array of caches. Though this works
sometimes, it is not always feasible. For example, the proxies may
go through different ISPs to connect to North America, and you can't
find a central location to put this cache. Another alternative is
to have a central directory server that knows what URLs are cached at
every proxy, and each proxy just needs to ask the directory server.
However, the directory server can easily become a bottleneck.
Ideally, we want a protocol that is as loosely coupled as
ICP, keeps the inter-proxy cache hit ratio
high, because that's the whole purpose behind cache sharing, and
minimize the inter-proxy messages, and it should be able to scale to
a large number of proxies.
Slide 7:
We propose Summary Cache. The idea is that each proxy keeps a
directory of what URLs are cached at other proxies, and when there is
a cache miss, probe this directory first before sending out queries.
In other words, the directory acts as a filter to reduce query messages.
There are two problems with this idea. First, how to keep the
directories up to date? If a proxy has to send out update messages
everytime it puts a new document in cache, then we didn't save any
message compared to ICP. The solution is to delay and batch the updates,
that is, the directory may get slightly out of date, but hopefully
it would hurt much in terms of performance. The second issue is the
storage requirements. For performance reasons one want to keep
the directories in DRAM. However, a proxy often holds over 1
million Web pages, and each URL on average is 40 bytes long. That
means 40MB for each neighboring proxy, and if you have 16 neighbors,
that's 640MB of DRAM. The solution to this is to compress the
directories. That is, the directory does not have to be precise,
but should be inclusive, meaning that it indicates a larger set of
URLs than what are cached.
For summary cache to be a success, we need to have two things:
1. the delay in updating the directories will not cause the total
cache hit ratio to drop excessively;
2. we can find a way to compress the directory so that it takes little
memory and doesn't yield too many query messages;
But before we look into these, let's analyze the mistakes
introduced by the approximations;
Slide 8:
There are three kinds of mistakes. The first one is false misses.
This is the case when there is a cache miss at proxy A, the URL is
cached at B, but A does not know, and misses the oppurtunity of a
remote cache hit. This is mainly introduced by the delay in updating
directories, and its effect is lower total cache hit ratio.
The second one is false hits. This is the case when the URL is not
cached at B, but A thinks it is, and sends a query message, only to
be told that it is not there. This is mainly caused by the compression
of the directory, and its effect is wasted inter-proxy messages.
The third one is stale hits. This is the case when the URL is cached
at B, A fetches it, but B's copy is stale. This is caused by the
lack of invalidation mechanism in today's Web caches, and its effect is
wasted inter-proxy message.
As you can see, the delay in directory update affects the total cache
hit ratio,and the compression of directories affects the inter-proxy
traffic. Now let's look at the two.
Slide 9:
We use trace-driven simulation to examine the effect of update delay.
Here I show the results from two traces, one from the proxy at
DEC, and another from a regional ISP at Australia called QuestNet.
On the x-axis I am showing you the threshold for update delay. The
way we determine when it is time to update the directories is when
the percentage of the Web pages that are in a proxy's cache but not
reflected in its directory at other proxies reaches a certain
threshold. Hopefully, the false miss ratio is proportional to the
threshold. On the y-axis I am showing you the total hit ratio under
ICP, and under summary cache. As you can see, even with 5% or 10%
threshold for update delay, the effect on total hit ratio is quite
small, only about 5%(CHECK!). The threshold translates to updating
directories every 10 min to an hour in our traces. So the results
clearly show that summary cache is promising.
Slide 10:
Now, how do we compress the directory? We already know that
representing the directory as a list of document URLs takes too much
memory. Our first try is to simple use the server portion of the URLs.
Since one Web server typically has many documents, we got a factor of
10 reduction in memory requirement in our traces. However, this
compression is too imprecise, it generates too many false hits, and
as a result there are too many inter-proxy messages.
Slide 11:
Then at that time, I was visiting DEC SRC, and I talked to my
friend Monica Herzinger. I said:"Monica, here is my problem. Suppose
I have a bag of integers. How can I come up with a compact representation
of this bag of integers, so that I can ship this representation somewhere
else, and at that somewhere else, I can just probe this representation and
know whether an arbitrary integer is in this bag or not?" (well, bag of URLs
and bag of integers are the same thing.)
And she said: "talk to Andrei, he knows about this". And Andrei introduced
me to a real neat technique called Bloom filters.
Slide 12:
The way bloom filter works is the following. You construct a large
bit array. Initially all bits are 0. Then you pick a number of independent
hash functions; in this case 4. Then for every URL in the bag, you apply the
four hash functions, getting four integers, then you use the four integers
as indices into the bit array, and turn all four bits to 1. You do this
for every URL in the bag.
Then, you ship this bit array and the specification of the hash functions to
the remote site. Then at the remote site, to see whether a URL is in this
bag, you apply the four hash functions to the URL to get four indices, and
then you check to see whether all four bits are 1. If any one of the four
bits are 0, then the URL is definitely not in the bag. If all of them are 1,
then there is a high probability that the URL is in the bag.
The nice thing about Bloom filters is that there is a very nice tradeoff
between the size of the bit array and the false positive ratio. As you
increase the size of the bit array, the false positive ratio drops
exponentially.
Slide 13:
To see why that is the case, let's take a look at the math. Suppose we have
n URLs, and we pick the size of the bit array, m, which should be bigger than
two times n. Now, what is the right number of hash functions, k?
On the one hand, the larger the k, the more tests the remote site has to
do, and it reduces the false positive ratio. On the other hand, the larger
the k, the more bits are turned to 1 in the bit array, and the more chances
for false collision in the remote site.
As you can imagine, it turns out that k is optimal when exactly half of the
bits are 1 in the bit array. After some math calculations, it means that
the optimal k is ln(2)*m/n, or an integer close to it.
And since exactly half of the bits are 0, the false positive ratio at the
remote site is 1/2 to the power of k. You plug the formula in, and you
get that with optimal k, the false positive ratio is 0.62 to the power of
m/n.
Typically, for m/n = 6, that is, using 6 bits per URL, the false positive
ratio is about 6%, and for m/n = 8, the false positive ratio is about 2%.
Note: I got lazy here and didn't write down the rest of the notes. But
you should have got the main ideas by now and you can get the rest from the
paper.
Slide 14:
< explain hash functions built by MD5;>
< explain local counter array;>
< explain updates;>
Slide 15:
explain why BF-8 and BF-16 are so close;
Slide 16:
< see, we still get pretty good hit ratios with bloom filter based summary. >
Slide 17:
We implemented a prototype in Squid 1.1.14. The details are described in
the paper.
Experimental results; MD5 calculation isn't expensive;
Slide 18:
Conclusion.