Abstract
Web pages are changing everyday in the world with great
amount of information updated and perhaps disappearing. At the same time,
disks are becoming ever cheaper and larger. We argue that it is desirable
and feasible to record a history of web pages using the specially designed
proxy.
This paper describes the design, implementation and
performance of the Pachyderm system, a web proxy that automatically retains
all the versions of web pages retrieved from the web servers to form a
history on disk, and make them accessible to users. The pages are versioned
by the file names on disk. Storage in Pachyderm is managed using user-specified
storing policies. Access of history is implemented with glimpse as the
searching tool, and the Pachyderm is built on the base of Junkbuster, a
simple filtering proxy.
Whole paper available as: PDF or
PS
Source code available here.
A description of the files can be found here.
Here are some experimental results
|