In this paper we have shown that a simple model based on Zipf's law can explain the asymptotic behavior for three properties that are observed in real Web cache traces. Therefore, we argue that Zipf's law should perhaps be considered as an appropriate model for Web reference streams at a particular level of abstraction. Our results also indicate that these properties found in many studies are perhaps inherent to Web access streams and not an artifact of the traces studied.
We also revisited the issue of Web cache replacement in light of the new model and found that the Least-Frequently-Used (LFU) algorithm as dictated by the new model indeed performs best in terms of byte hit-ratios. However, practical implementations of LFU, which typically keep reference counters for in-cache documents only, performs poorly and should not be used as a cache replacement algorithm.
In order to gain an understanding of how document reference probabilities change with time, we investigated the daily changes of the hot set in Web reference streams. Looking at one proxy traces, we found that about a third of the documents in the hot set (those that account for 10% of the total reference) changes from day to day. On the other hand, about 60% of the documents remain in the hot set for more than three days.
Although we have argued in this paper that Zipf's law enables us to construct a reasonably accurate model, some researchers believe that Zipf's law is not accurate enough. In particular, Almeida et al. compared plots of miss-ratio for a synthetic workload having a Zipf distribution against plots of miss-ratio for real workloads and concluded that the Zipf model was inaccurate because it did not capture the locality of reference in the real request stream [ABCdO96].
Clearly our model has limitations.
First, the model does not consider the cache's replacement policy,
which no doubt plays a critical role in a cache's performance.
Second, Zipf's law over-constrains the request probability
function as where
. One question that
should be investigated is: how would the results for hit-ratio and
interarrival times change if the exponent
were not
constrained?
Third, the model does not consider document modifications.
However, our mathematical model has the advantage that it can
predict the asymptotic behavior of three properties
where as regular simulations would require overly large trace files
to obtain such asymptotic results.
Our goal is to find a model that is accurate enough to assist Web cache design and configuration. We are currently trying to improve the model and trying to understand its applicability to other proxy traces as well as its implications for Web cache design. One interesting question is whether the Zipf's model applies to Web accesses seen by the parent proxy in a proxy hierarchy.