I have Prof. Margo Saltzer's code here. What you need to do is to start from the existing code and change it until it works under FreeBSD or NetBSD. The hard part is the stability and the robustness of the code. Getting it working won't be too hard; making sure that it doesn't core-dump the kernel once every two hours is the hard part.
Or, if you prefer, you can start with Prof. Saltzer's code and try to make it work under Linux.
Related Papers; the "log-structured file system" paper in the reading material. Also, Prof. Margo Salzter has many papers describing her implementation; you can find the list of papers here and the source code here.
Since then, the journaling file system (basically a file system that logs metadata changes) has become quite dominant. Naturally, people want to know whether the journaling file system has a different disk access pattern from the UNIX fast file systems. Dr. Wilkes has agreed to provide the disk I/O traces from a journaling file system, similar to the one they used in their 1992 paper. What you need to do is to repeat the study on the new set of traces, and make a contribution to the research community by detailing the disk access characteristics of journaling file systems.
One solution to this problem is to store HTTP objects directly on disk, and
use a separate directory structure to find the object given its URI.
In this project, you will design and implement the directory structure,
which translates a URI to the disk location of the object. The directory
needs to meet the following performance criteria:
1. The translation must be fast: given a URI, it should quickly return a disk
block address or return -1 (meaning the object is not in the cache);
2. Inserting a new (URI, disk-location) pair and deleting an entry must be
fast;
3. The directory cannot assume that it will fit in memory all the time;
sometime some parts of it have to on disk. Thus, it must be designed
such that accesses to it have good temporal and spatial locality;
The directory does not have to worry about
Fortunately, we also know a lot about access patterns to these objects. We know that there are very hot objects, many luke-warm objects, and a large number of cold objects. We know that most of the times when the directory is searched for a URI, the URI is cached. We also know that some objects tend to be access together.
Thus, your job is to design and implement the best directory structure based on these access characteristics and performance requirements! The directory does not have to handle the layout of the object on disk --- that is the job of a separate module and that module will call "insert" to enter new translations in the directory. However, if you have time and would like to work on this disk layout module as well, that is even better!
Related papers:
(1) on Web access patterns: click
here.
(2) on performance problems of existing Web caching software, click
here.
Fortunately, people have learnt more about Web access patterns and user's surfing behavior. There are also a lot of Web access traces publicly available for research purposes.
In this project, you will build a trace-driven simulator. The simulator takes as input a trace of Web accesses made by users in an organization and a description of the internal network of the organization as well as its connection to the Internet, and simulates various cache locations and cache configurations. The simulator would then recommend a design that optimizes measures including Internet traffic and client latency.
Related papers:
(1) on protocols that support collaboration of caches, click
here.
(2) on Web access patterns: click
here.
This leads to the idea of using DRAM to cache HTTP objects. However, in practice one would need to use a combination of DRAM and disks as the cache, because after all, disks can be much larger than DRAMs. The trick is in designing the right main-memory cache replacement algorithm and the disk cache replacement algorithm so that the traffic to disk is as low as possible.
In this project, you will use trace-driven simulation to find what the best main-memory replacement algorithms and the best disk cache replacement algorithms are. Your simulator must model both caches and the overhead associated with disk I/Os. You must evaluate a variety of replacement algorithms; I will give you my guesses of good algorithms, but it is through simulating and analyzing the results that we will find the best ones. I will supply the Web access traces.
Related papers:
(1) on Web access patterns: click
here.
(2) on good replacement policies: click
here.
One solution is to put a number of "rental" Web servers at strategic locations around the Internet, and when a hot spot occurs, the original Web server will ask the "rental" servers to act as mirror servers for it. When the hot spot is over, the rental servers stop acting as a mirror.
In this project, you will modify the Apache, the most popular Web server software, to support this type of auto-mirroring. Specifically, the rental server should be able to accept request for mirroring support and process user requests for the documents of the hot spot server. The original hot spot Web server should make sure that all copies at mirror servers are consistent with the copies on itself. You should also think about support for dynamic documents including cgi-scripts, etc.
A year ago, we built the first benchmark for Web proxies, called Wisconsin Proxy Benchmark (WPB) 1.0. More than a dozen proxy vendors and research groups have used our benchmark to test their products or investigate optimization techniques. Despite its success, there are limitations to WPB 1.0, including that it does not model persistent connection, it uses a process-based structure which is heavy weight, and it does not model spatial locality.
Over the summer I started working on WPB 2.0. I have added support for persistent connection, and switched the benchmark to an event-driven architecture. However, it does not yet model spatial locality, sessions, URI path length, document size distribution, and document latency distributions.
In this project, you will take the half-developed WPB 2.0, and add in models of all the characteristics listed above, and use the new benchmark to measure a variety of proxy products, and test whether each of the characteristics affects the proxy's performance. I will provide all the infrastructure for the experiments.
Related papers:
(1) Measuring
Proxy Performance using Wisconsin Proxy Benchmark 1.0.
(2) a paper
on Web access patterns.
New:
(1) I recently wrote up a position paper on what the proxy benchmark should
provide. It is here.
(2) The paper describing the httperf tool, which WPB 2.0 is based on, is
here. The postscript of this paper is here.
(3) For those of you who are working on this project, send me email to get access to my
half-developed code.
Similarly, there are interests on porting lmbench to Java Virtual Machines. Of course, JVM doesn't support the UNIX system calls, but it does support similar facilities such as threads, signals, etc. Therefore, it is interesting to rewrite part of lmbench in Java and test the relative performance of various JVMs.
The project does not require you to port all of lmbench to NT and/or JVM. Rather, you need to study Win32 API and Java, and come up with a plan of how much of lmbench you can port, and then we discuss the plan.
Recently, there is a new approach to control resources used by Web requests called session control. The idea is that if the Web server is overload, then on some new requests, it should return a note saying "I am too busy right now; try after 10 seconds" and terminate the connection. It will also send a cookie to the browser, and next time the same user sends Web requests, it will admit the user. The idea is described in a HP Labs Technical Report "Session Based Admission Control: A Mechanism for Improving the Performance of an Overloaded Web Server".
In this project, you will merge these two ideas, and use session-based admission control to provide differentiated levels of services. Specifically, you need to change Apache so that it can support different classes of Web pages, each class occupying a specific percentage of the Web server resource. Apache should then monitor the requests to these Web pages, and if Web pages in some class have too many requests for them, it will refuse to service new requests to these Web pages until the server is lightly loaded.
One of the major format of continuous media is Progressive Network's RealAudio and RealVideo. Unfortunately, up till now there is one commercial implementation of caching proxies for RealAudio and RealVideo. In this project, you will change this by providing an academic implementation free to the research community.
You will study the protocol specification for RealAudio/RealVideo, inspect reference implementations of the protocol, and design and implement an intermedia proxy that is capable of caching the multimedia document. The project is challenging, but is also very rewarding.