Project Reports for CS736 Fall 1998

CS736 students in Fall 1998 have successfully completed the following projects. Below are short descriptions of the projects.


Log-Structured File System for FreeBSD or Linux

Even though log-structured file system was proposed in 1990, there still isn't a log-structured file system running on the Freeware operating systems (FreeBSD, Linux, etc.). Prof. Margo Saltzer at Harvard University implemented a prototype LFS for an old version of BSDi, but the prototype doesn't work on any of the free BSD versions. There have been talks about implementing LFS for Linux for years, but not much progress has been made.

I have Prof. Margo Saltzer's code here. What you need to do is to start from the existing code and change it until it works under FreeBSD or NetBSD. The hard part is the stability and the robustness of the code. Getting it working won't be too hard; making sure that it doesn't core-dump the kernel once every two hours is the hard part.

Or, if you prefer, you can start with Prof. Saltzer's code and try to make it work under Linux.

Related Papers; the "log-structured file system" paper in the reading material. Also, Prof. Margo Salzter has many papers describing her implementation; you can find the list of papers here and the source code here.


Disk Access Characteristics of Journaling File Systems

In 1992, Dr. John Wilkes and his student at HP Labs studies the disk access patterns of the UNIX file systems. The discoveries have significantly influenced the design and optimization of disk drives. The paper can be found here.

Since then, the journaling file system (basically a file system that logs metadata changes) has become quite dominant. Naturally, people want to know whether the journaling file system has a different disk access pattern from the UNIX fast file systems. Dr. Wilkes has agreed to provide the disk I/O traces from a journaling file system, similar to the one they used in their 1992 paper. What you need to do is to repeat the study on the new set of traces, and make a contribution to the research community by detailing the disk access characteristics of journaling file systems.


Directories and Disk Layout for HTTP objects

Most Web caches today use the file system to store HTTP objects (i.e. Web pages). Each object is named by its URI (Universal Resource Identifier) and stored as a file. The translation from the URI to the file is either by direct path translation (for example, http://www.cs.wisc.edu/index.html translates to http/www.cs.wisc.edu/index.html) or is done through a separate hash table. However, this approach has poor performance. The reason is that in most file systems, file creation and deletion lead to synchronous disk I/Os. Furthermore, file directories are typically implemented as a linear list of file entries, making the file open an expensive operation in large directories.

One solution to this problem is to store HTTP objects directly on disk, and use a separate directory structure to find the object given its URI. In this project, you will design and implement the directory structure, which translates a URI to the disk location of the object. The directory needs to meet the following performance criteria:
1. The translation must be fast: given a URI, it should quickly return a disk block address or return -1 (meaning the object is not in the cache);
2. Inserting a new (URI, disk-location) pair and deleting an entry must be fast;
3. The directory cannot assume that it will fit in memory all the time; sometime some parts of it have to on disk. Thus, it must be designed such that accesses to it have good temporal and spatial locality;
The directory does not have to worry about

Fortunately, we also know a lot about access patterns to these objects. We know that there are very hot objects, many luke-warm objects, and a large number of cold objects. We know that most of the times when the directory is searched for a URI, the URI is cached. We also know that some objects tend to be access together.

Thus, your job is to design and implement the best directory structure based on these access characteristics and performance requirements! The directory does not have to handle the layout of the object on disk --- that is the job of a separate module and that module will call "insert" to enter new translations in the directory. However, if you have time and would like to work on this disk layout module as well, that is even better!

Related papers:
(1) on Web access patterns: click here.
(2) on performance problems of existing Web caching software, click here.


Hierarchy Design for Web Caching

Many organizations are interested in setting up Web caches these days to cut down the Internet traffic. However, for large organizations that have offices around the country or even the globe, where to put the caches and how to coordinate the caches are unsolved problems.

Fortunately, people have learnt more about Web access patterns and user's surfing behavior. There are also a lot of Web access traces publicly available for research purposes.

In this project, you will build a trace-driven simulator. The simulator takes as input a trace of Web accesses made by users in an organization and a description of the internal network of the organization as well as its connection to the Internet, and simulates various cache locations and cache configurations. The simulator would then recommend a design that optimizes measures including Internet traffic and client latency.

Related papers:
(1) on protocols that support collaboration of caches, click here.
(2) on Web access patterns: click here.


Replacement Algorithms in DRAM/Disk Caching of HTTP objects

Using caches to improve Web performance has been proved to be a good idea for people who pay the bills for the Internet connections. Alas, it has not been proved to be a good idea for the users who surf the Web. The reason: most Web caches today use disks to cache HTTP objects, and if you think the Internet is slow, the disk is just as slow.

This leads to the idea of using DRAM to cache HTTP objects. However, in practice one would need to use a combination of DRAM and disks as the cache, because after all, disks can be much larger than DRAMs. The trick is in designing the right main-memory cache replacement algorithm and the disk cache replacement algorithm so that the traffic to disk is as low as possible.

In this project, you will use trace-driven simulation to find what the best main-memory replacement algorithms and the best disk cache replacement algorithms are. Your simulator must model both caches and the overhead associated with disk I/Os. You must evaluate a variety of replacement algorithms; I will give you my guesses of good algorithms, but it is through simulating and analyzing the results that we will find the best ones. I will supply the Web access traces.

Related papers:
(1) on Web access patterns: click here.
(2) on good replacement policies: click here.


Auto-Mirroring Support in Apache

The hot-spots on the Internet are hard to predict --- who would have thought of the Starr report a month ago? But when such hot spots happen, they do a lot of damage to the Internet, overloading the routers and blocking accesses for a lot of people.

One solution is to put a number of "rental" Web servers at strategic locations around the Internet, and when a hot spot occurs, the original Web server will ask the "rental" servers to act as mirror servers for it. When the hot spot is over, the rental servers stop acting as a mirror.

In this project, you will modify the Apache, the most popular Web server software, to support this type of auto-mirroring. Specifically, the rental server should be able to accept request for mirroring support and process user requests for the documents of the hot spot server. The original hot spot Web server should make sure that all copies at mirror servers are consistent with the copies on itself. You should also think about support for dynamic documents including cgi-scripts, etc.


Wisconsin Proxy Benchmark 2.0

Benchmarks are very important. They push vendors to improve their products and benefit customers. It is also very important to construct good benchmarks, which reflects how a system is used in practice and stresses the tested system in ways that are similar to real life situations.

A year ago, we built the first benchmark for Web proxies, called Wisconsin Proxy Benchmark (WPB) 1.0. More than a dozen proxy vendors and research groups have used our benchmark to test their products or investigate optimization techniques. Despite its success, there are limitations to WPB 1.0, including that it does not model persistent connection, it uses a process-based structure which is heavy weight, and it does not model spatial locality.

Over the summer I started working on WPB 2.0. I have added support for persistent connection, and switched the benchmark to an event-driven architecture. However, it does not yet model spatial locality, sessions, URI path length, document size distribution, and document latency distributions.

In this project, you will take the half-developed WPB 2.0, and add in models of all the characteristics listed above, and use the new benchmark to measure a variety of proxy products, and test whether each of the characteristics affects the proxy's performance. I will provide all the infrastructure for the experiments.

Related papers:
(1) Measuring Proxy Performance using Wisconsin Proxy Benchmark 1.0.
(2) a paper on Web access patterns.

New:
(1) I recently wrote up a position paper on what the proxy benchmark should provide. It is here.
(2) The paper describing the httperf tool, which WPB 2.0 is based on, is here. The postscript of this paper is here.
(3) For those of you who are working on this project, send me email to get access to my half-developed code.


Porting lmbench to Windows NT and/or Java JVM

Well, you have run and studied lmbench, the micro benchmark for UNIX operating systems. There are a lot of interests on porting lmbench to Windows NT these days, because people want to know and understand performance under Windows NT. Of course, the tricky part is not in recompiling lmbench under the POSIX subsystem in Windows NT, but rather in rewriting lmbench to use the native Win32 APIs (after all, they are the real well-supported ones on Windows NT).

Similarly, there are interests on porting lmbench to Java Virtual Machines. Of course, JVM doesn't support the UNIX system calls, but it does support similar facilities such as threads, signals, etc. Therefore, it is interesting to rewrite part of lmbench in Java and test the relative performance of various JVMs.

The project does not require you to port all of lmbench to NT and/or JVM. Rather, you need to study Win32 API and Java, and come up with a plan of how much of lmbench you can port, and then we discuss the plan.


Session-based Differentiated QoS in Web Content Hosting

Web content hosting, in which a Web server stores and provides Web access to documents for different customers, is becoming increasingly common. Due to the variety of customers (corporate, individuals, etc.), providing differentiated levels of service is often an important issue for the hosts. Last year, three 736 students modified Apache to introduce the concept of service levels, that is, requests to some Web pages receive higher priority than requests to other pages. Their report is here.

Recently, there is a new approach to control resources used by Web requests called session control. The idea is that if the Web server is overload, then on some new requests, it should return a note saying "I am too busy right now; try after 10 seconds" and terminate the connection. It will also send a cookie to the browser, and next time the same user sends Web requests, it will admit the user. The idea is described in a HP Labs Technical Report "Session Based Admission Control: A Mechanism for Improving the Performance of an Overloaded Web Server".

In this project, you will merge these two ideas, and use session-based admission control to provide differentiated levels of services. Specifically, you need to change Apache so that it can support different classes of Web pages, each class occupying a specific percentage of the Web server resource. Apache should then monitor the requests to these Web pages, and if Web pages in some class have too many requests for them, it will refuse to service new requests to these Web pages until the server is lightly loaded.


Caching Support for Continuous Media on the Internet

In the near future, a significantly part of the Internet bandwidth could very well be consumed by continuous media (audio and video). Since these multimedia files almost never change, caching them could save big.

One of the major format of continuous media is Progressive Network's RealAudio and RealVideo. Unfortunately, up till now there is one commercial implementation of caching proxies for RealAudio and RealVideo. In this project, you will change this by providing an academic implementation free to the research community.

You will study the protocol specification for RealAudio/RealVideo, inspect reference implementations of the protocol, and design and implement an intermedia proxy that is capable of caching the multimedia document. The project is challenging, but is also very rewarding.