next up previous
Next: History Structure Up: Prediction Algorithms Previous: Prediction Algorithms

Overview

The prediction algorithms in this study observe patterns from past accesses from all the clients to predict what each client might access next. The patterns we capture are in the form of ``a user is likely to access URL B right after he or she accesses URL A.'' Clearly, only accesses from the same user should be correlated; accesses from different users are not related and may arrive at the proxy in any order.

The actual prediction algorithm is based on the Prediction-by-Partial-Match (PPM) data compressor [1,6]. A m-th order predictor uses the context of the past m references to predict the next set of URLs. The algorithm maintains a data structure that keeps track of the URLs following another URL, following a sequence of two URLs, and so on up to a sequence of m URLs. In the context of the past m references, the immediate past reference, the past two references, the past three references, etc. are matched against this data structure to produce the set of URLs that are likely to be accessed next. The URLs are sorted first by giving preferences to longer prefixes, and then by giving preferences to URLs with higher probability within the same prefix.

There are two aspects of the algorithm that are particular to the Web proxy contexts. First, the algorithms use patterns observed from all users' references to predict any particular user's future references. On the one hand, this may reduce the accuracy of the prediction because one user's patterns are used to predict a different user's accesses. On the other hand, this increases the number of times the algorithms can make predictions because the sampling size is increased. We feel that in the context of prepushing, the second factor is likely to be more important. A good browser cache should absorb most of the repeated accesses from the user, and it is important to predict the first-time access to a document from a user (i.e. a modem client).

Second, the algorithms separate embedded objects from non-embedded objects, and treat a document and its embedded objects as the same entity. That is, if the algorithm decides to push the document to the client, it pushes the embedded objects as well. The reasons for doing so are two fold. First, embedded objects have an access probability of 1 following the access of the document, and finding the embedded objects in practice is easy -- the proxy simply has to scan the document. Second, embedded objects are often fetched concurrently over four connections to the proxy, and the order of request arrival at the proxy can be random. Separating them out can simplify the data structures keeping track of the observed patterns.

Since the trace does not indicate which URLs are embedded in other documents, we estimate the information by treating all ``.gif'' files that are referenced after the reference of one ``.html'' file and before the reference of another ``.html'' file to be the embedded objects in the first ``.html'' file. This estimate errors on the side of overclassifying ``.gif'' files as embedded objects, and we are working on refining the estimate.


next up previous
Next: History Structure Up: Prediction Algorithms Previous: Prediction Algorithms
Pei Cao
4/13/1998