Pachyderm: The Web Proxy that Never Forgets

Junfengy, Xin Li

{junfengy, lx}@cs.wisc.edu

Abstract

Web pages are changing everyday in the world with great amount of information updated and perhaps disappearing. At the same time, disks are becoming ever cheaper and larger. We argue that it is desirable and feasible to record a history of web pages using the specially designed proxy.
This paper describes the design, implementation and performance of the Pachyderm system, a web proxy that automatically retains all the versions of web pages retrieved from the web servers to form a history on disk, and make them accessible to users. The pages are versioned by the file names on disk. Storage in Pachyderm is managed using user-specified storing policies. Access of history is implemented with glimpse as the searching tool, and the Pachyderm is built on the base of Junkbuster, a simple filtering proxy.



Whole paper available as: PDF  or PS

Source code available here. A description of the files can be found here.

Here are some experimental results

  • the comparision of old design and current design
  • the 10-day trace data and the statistics result
  • the history access time test data
  • the web polygraph test data

  •