Next: The Model
Up: On the Implications of
Previous: Introduction
A number of existing publications point out the applicability of
Zipf's law to Web accesses. Early research includes [Gla94] and
[ABCdO96].
[Gla94] studied Web accesses of about 60 users for a period of weeks,
and found that the accesses follow the 1/i distribution closely.
[ABCdO96] also studied Web accesses of about 200 users in a department, and
collected Web accesses that are not filtered by the browser cache. They also
found that the accesses follows 1/(i0.97), which is close to 1/i.
Recently, [Kim98] reports that on a large proxy trace, they found that
Web accesses follows the Zipf's distribution, but with 1/(i0.67). That is,
the alpha value tends to be 0.67, instead of 1.
Since we also have accesses to five large proxy traces, we studies whether they
follow Zipf's distribution as well. Our conclusions are as follows:
- Web accesses seen by the proxies tends to follow Zipf's distribution
very well, but with alpha ranging from 0.57 to 0.67, instead of 1.
Environments with small number of users or homogenous users (for example,
employees of a company) tend to have higher alpha. Environments with a
large number of users tend to have smaller alpha. These findings are in
agreement with [Kim98].
- Web requests seen by parent caches, that is, those that are filtered
by child proxies, also follows Zipf's distribution very well, though the
alpha tends to be smaller (around 0.57). This is a somewhat surprising
result, though explainable.
- There is no correlation between the popularity of a document and its
size; on average, hot documents do not tend to be smaller or larger than
cold documents.
- Some of the hot documents change very often. Hot documents are not
more stable than cold documents. In terms of the ratio of the changes per
access, the range of the ratio for hot documents is as wide as it is for
cold documents. Even for one of our traces (NLANR) which doesn't have
last modified date information, and we can only infer changes from the
document's size change, the same conclusion holds.
- Hot documents are roughly evenly distributed over the hot Web servers.
The number of hot Web servers as a function of hot Web documents (the first
x documents) is y=x3/4.
That is, there is no Web server that contribute the most of hot documents.
These observations have implications on Web caching algorithms and Web
cache consistency maintainence. The fact that hot documents are not necessarily
small means that an algorithm that always replaces the largest document may
not work well. The fact that hot documents can change often means that any
cache consistency scheme should not assume that the more popular a document
gets, the more stable it is.
Finally, the fact that hot documents are more or less evenly distributed
across hot Web servers implies that there is no one Web server that
overly dominates on the Web, but rather, a collection of Web servers absorb
a large percentage of the Web traffic.
Graphs supporting the claims.
Next: The Model
Up: On the Implications of
Previous: Introduction
Pei Cao
6/2/1998