CS 736 Project Ideas

CS 736
Spring 1997
Term Project Ideas

WARNING: Under Construction

Marvin Solomon
solomon@cs.wisc.edu

Last updated: Sat Feb 14 10:32:26 CST 1998


Contents


General Comments

The projects are intended to give you an opportunity to study a particular area related to operating systems. Some of the projects will require test implementations, measurement studies, simulations, or some combination. Most will require a literature search before you begin.

The project suggestions below are briefly stated. They are intended to guide you into particular areas, and you are expected to expand these suggestions into a full project descriptions. This gives you more freedom in selecting an area and more burden in defining your own project. There may be more issues listed for a project than you can cover. I would prefer you to think up a topic that is not listed below. If you do, you might want to come and talk with me so we can work out a reasonable project description.

Some projects are appropriate for groups of two or more. There is no upper bound on the size of the group, but beware that groups of more than 3 persons are very hard to manage. I will not get involved in labor disputes; you will all hang together or you will be all hanged separately. Feel free to ask me for my opinions whether the size of a proposed team is appropriate for a given project.

In some cases, a project area straddles the boundary between operating systems and some other area (such as database, architecture, artificial intelligence, or programming languages). Such projects are intended for students with background and interests in the second area to explore the interface. They are not intended as substitutes for the regular courses in the second area. For example, if you have little or no background in database, you should take CS 764 before choosing a topic that requires database sophistication. Most topics call for a careful literature search before the proposal due date.

Due Dates

Project Proposal Due
Tuesday, March 3
Final Report Due
Thursday, May 7

Project Proposal

You are to hand in an expanded description of your project (see the due date above; you are free to turn it in sooner). The project proposal should be brief (two pages or less), but very specific about what you intend to do. It must be long enough to convince me that you have a reasonable and achievable project (i.e, not trivial and not too large). You may have to revise your description before it will be acceptable.

The project description should describe the problem that you are addressing, how you will go about working on this project (the steps you will take), what results you expect and what you expect to produce, what resources you will need, and a brief schedule for your project. It must be reasonably well written. Projects that involve implementation or simulation should indicate what resources are required.

You should make an ordered list of features together with your current best guess on what you intend to accomplish together with contingency plans in case of unforeseen problems (``I will definitely do (a) and then (b). If a.3 turns out to be impractical, however, I will do (b') instead. If time allows, I will also do (c). If things go especially well, I would also like to try (d), (e), and (f), in that order'').

You should have already done a substantial amount of background work on the project before writing the proposal. For example, if you intend to do a simulation, you should have familiarized yourself with all available software tools, and decided which are most appropriate. If you intend to build a useful tool such as a threads package or a distributed make tool, you should know about all such tools that have been built before and described in the open literature. There is no reason why you shouldn't do something that has been done before. After all, the main purpose of this project is learning. But if it has been done before, you should learn about previous attempts so that you can learn from their mistakes rather than simply repeating them yourselves.

I will review all proposals and offer comments. Sketchy proposals will get sketchy comments. I will also indicate my opinion of the quality of each proposal.

The Project Report

At the end of the semester, you will hand in a project report. The length will vary depending on the type of project, but no paper should be over 15 pages unless you get specific prior approval for a longer report. (A famous person once wrote in a letter, ``Please excuse the length of this letter. I did not have the time to make it shorter.'') In all cases, the quality of the writing will be a factor in the grade. You will also make a short oral presentation to the class and, if appropriate, demonstrate your software.

Peer Reviewing

Your project report should be read and reviewed by at least one other person in the class. It is up to you to select the person. This person will critique your paper and you will use the critique to revise your paper.

Project Suggestions

Java:

Java is currently very trendy. All sorts of Java-related projects are possible.

Naming in Large Computer Networks:

Consider the naming of resources (e.g., mail address, servers, etc.) in a distributed environment with many (1000 or more) computers. This environment might include universities, companies, and government agencies. Each of these areas might include other environments (e.g., the university might include a CS department, computer center, ECE department, etc.). A name service for such an environment is a special-purpose distributed database. A server can register services. Each registration includes a description of the service provided (e.g., ``mail delivery'') and information necessary to use the service (e.g., ``connect to address [128.105.2.33], port 25''). A client can look for a specific service (e.g., ``How do I deliver mail to host gjetost.cs.wisc.edu?'') or make a more generic request (e.g., ``Find me a nearby printer that allows student access and supports PostScript''.

Design a name service for such an environment. Issues such as performance, local autonomy, scope of authority, reliability, protection, and expendability may be discussed. How are names used? (What studies might you do to find out?) What are the limits on the size of the environment that your design will be able to support? Evaluate your design through a pilot implementation or a simulation.

For background, read about Grapevine, Clearinghouse, the Arpanet Domain Name Service (see me for more specific references).

Group Communication:

Several researchers have developed protocols and software packages to facilitate communication among processes in a distributed program. A process supplies information by delivering messages to the system and consumes it by registering requests. The system forwards each message to processes that expressed interest in it. Details differ considerably among the various proposals. Examples include the FIELD system from Brown university, the ISIS system from Cornell, and the Linda language from the University of Maryland. Numerous other proposals may be seen as variations on this theme, including other past and proposed 736 projects such as DREGS, Condor, and the switchboard. Among the dimensions of variability are
Implementation.
Some systems are implemented by a central server. Others are fully distributed, using broadcasts of messages and/or requests. Other possibilities include establishment of explicit routing trees, or using a central server only to introduce processes to one another and allowing them to engage in bilateral or multilateral communication thereafter.
Reliability and Security.
Some systems go to great lengths to cope with process and network failures, authentication and security, or out-of-order delivery, while others largely ignore these problems.
Matching and Synchronization.
Systems differ in criteria for matching messages with requests. The simplest approach is to require an exact match: Each message has a ``message type'' and each request specifies an interest in all messages of that type. Other schemes involve regular-expression string matching, general Boolean expressions, or Prolog-like unification. A related issue is whether a message is delivered to a single process (perhaps with some priority ordering), multicast to all who are interested, or saved for those who may request it in the future. Requests can be blocking or non-blocking.
Data Types.
Messages may be simple untyped byte strings or they may be typed structures. The system may provide type checking facilities, to make sure the receiver interprets the data as the sender intended, and it may even provide automatic data conversion among integer or floating-point representations, character sets, etc. A concrete example is Linda, which maintains a single, conceptually global tuple space . Linda provides the primitives put which adds a tuple to tuple space, get which waits for a tuple with a give first component to appear and then removes it from the space, and read which waits for a matching tuple, but does not remove it from the space.

Security Audit:

A properly managed computer system should be secure from illegal entry. Normal users should not be able to obtain privileges beyond what they are given. Most systems in everyday have security holes. Normally, it is considered a violation of standards of ethical behavior to take advantage of such holes. However, a ``tiger team'' is a team specifically authorized to find as many security holes as possible and report them to responsible management.

Select a facility in the Computer Sciences Department or elsewhere and find, demonstrate, and document as many security problems as possible. You may attack the system either from the position of an ``ordinary'' user, with an account but no special privileges, or from the point of view of an outsider--someone who is not supposed to be able to access the facility at all. You should find as many security problems as possible. These problems include system flaws, improper management, and careless users. The results of this study should be a report of the problems, with suggestions for fixes in the system, system design, and changes in management procedures. You should not explore ``denial of service'' attacks such as jamming networks or crashing systems. Warning: A project of this kind must be approved in advance by the person responsible for the facility you are proposing to attack!

File Servers for Workstations:

Workstations are available with and without local disks. Bulk storage is provided by a combination of remote file servers, local disk, and local RAM memory. Servers provide remote devices, remote files, or other abstractions. A variety of schemes for providing a ``seamless'' global file service have been suggested, including remote disk simulation, remote file access (e.g. NFS from Sun Microsystems) whole-file caching on local disk as in the Carnegie-Mellon ITC system (Andrew file system) and use of large local RAM for file caching, as in the Sprite system from Berkeley. The Locus system should also be studied for ideas about transparent global file naming.

Design a scheme for file access for a network of workstations. You should specify the functionality that is provided by the server and the responsibility of the client workstation. You will want to discuss reliability, fault tolerance, protection, and performance. Compare your design to the designs published in the literature. Evaluate the design by performing a simulation. See the ``Spritely NFS'' paper by Srinivasan and Mogul and the award-winning paper by Shirriff and Ousterhout from the Winter 1992 USENIX (see me for a copy) for examples of similar studies. See also related papers in SOSP and OSDI proceedings over the last several years.

Load Balancing:

Many systems such as LOCUS, Sprite, or Condor allows you to start processes on any machine, move processes during execution, and access files (transparently) across machine boundaries. Automatic placement of processes and other system resources could substantially improve overall system performance. There are several interesting issues in load balancing, including
  1. Collection of Data for Load Balancing: To make a load balancing decision, you might need data from each machine in the network. There are many forms that this data can take, and many designs for communicating this among machines. You must decide what data is needed, from where the data must come, and how it must be communicated.

    This problem becomes interesting in the scope of a very large network of computers (1000's of machines). You do not want to consume huge amounts of system resources making these decisions, and you do not want to make decisions based on extremely old data.

  2. Policies for Load Balancing Decisions: How do you decided when to move a process? On what do you base your decision? How frequently can we move processes (what is thrashing like in this environment)? What about groups of processes that are cooperating?
  3. Metrics for Load Evaluation: What load metrics do you use in evaluating an individual machine's capacity? Are these related to processing? storage? communication? How do we (can we) measure these? Are they accurate reflections of a machine's performance? How can you demonstrate this?
  4. File migration: We can move files, as well as processes. When do you move files vs. processes? Is only one needed? Which is better? How can you tell?

You are warned that is quite easy to suggest many plausible schemes for load balancing but not so easy to evaluate them. Therefore, a major component of any project in this area will be evaluation through simulation.

Security and Authentication:

The Popek and Kline paper on the reading list discusses use of encryption for authentication in distributed systems. It considers both conventional and public-key schemes. One popular implementation based on these ideas is the Kerberos system from MIT. Kerberos has been used to provide secure remote login, file transfer, and remote file access.

Use Kerberos or an ad hoc package to enhance the security of some existing system.

Random Software testing:

This suggestion is from Bart Miller.

The goal of this project is to test the robustness of Windows/NT applications. Several years ago, we built tools to generate and test programs by feeding them random input. The result of this study was that we were able to crash 25-33% of the standard UNIX utilities. Almost every UNIX manufacturer adopted our Fuzz testing tools as part of their release process. In 1995, we repeated and expanded these tests on more platforms and included X-window applications. The goal of this semester's project is to test OS kernels in a similar way to that we used to test application programs.

The basic tool is called the fuzz generator. This is a program that generates a random character stream. We used the fuzz generator to attack as many UNIX utilities as possible, with the goal of trying to break them. For the utilities that broke, we determined the the cause of the break. There is also a tool called ptyjig that allows random input to be fed to interactive programs.

The project is to test Windows applications in a similar manner to our 1995 testing of X-Window applications.

Navigating the World-Wide Web

The World-Wide Web is growing at an unbelievable pace. There's a tremendous amount of information available, but finding what you want can be next to impossible. Quite a few on-line search engines have been created to aid in resource location on the web. Click the Net Search button in the tool bar of NetScape for some examples. (Of particular note is WebCrawler, written by Wisconsin alumnus Brian Pinkerton, who sold it to America Online, reputedly for over $1 million!)

There are lots of ways of tackling this problem, but none discovered thus far is entirely satisfactory. Among the variables in design space are

Server Support
Does the provider of information cooperate in advertising it, or is the search entirely client-driven?
Caching
Does each search start from scratch, or is some sort of ``database'' used to guide the search? In the latter case, where is the database kept (at the client, the server, or somewhere in between)? How is it created? How is stale information detected and updated? How is the cache purged of valid, but seldom-referenced information?
Search Strategy
How does the search determine which information will be of interest to the user? How does determine which links to traverse, and in what order? When does it know when it has gone far enough?

Topology of the Web

A project closely related to the previous suggestion is to collect and analyze information about the current structure of the web. The web can be viewed as a vast directed graph. Gather as much information you can about this graph an analyze it. What is the average number of links out of a page? What is the average size of a page. What is the average distance between the pages at the two ends of a link (where ``distance'' is the number of links along a shortest path)? More generally, what are the distributions of these statistics? How do these things vary over time?

Information from this project would be of great interest to people proposing algorithms for traversing the web. This project has two distinct parts, both potentially quite challenging: gathering the data and analyzing it.

A General-Purpose Transaction Package:

The concept of a transaction--a sequence of actions that are executed atomicly and either commit (are reliably preserved forever) or abort (are completely undone)--was developed in the context of database systems, but transactions are useful in many areas outside of traditional database applications. Design and implement a portable transaction package. Look at Camelot , developed in the context of Mach, and libtb , built by Margo Seltzer and described in a recent Usenix proceedings.

Distributed Shared Memory:

There been a great deal of interest recently in an architecture called ``distributed shared memory''. The basic idea is to simulate a shared-memory multiprocessor programming model on top of a distributed system (a local-area network) by altering the page-fault handler of a traditional operating system to fetch pages over the network rather than the local disk. The 1991 SOSP contains a paper on an operating system called Munin , which explores some of the tradeoffs in page placement and replacement policies to support a variety of applications efficiently. Explore these issues by constructing a simulation.

Performance Study:

Monitor one or more of the Computer Science Department's machines or networks to determine its characteristics. Where are the bottlenecks? What sorts of programs are producing most of the load. What causes spikes in usage (and corresponding drops in response)?

For example, in a recent USENIX conference Matt Blaze describes a publicly available program for eavesdropping on NFS traffic on a local area Ethernet and gathering statistics. Install this program, use it to gather some statistics, and compare them with similar data from the literature. See also the suggestions regarding distributed file systems above.

Distributed or Persistent Garbage:

The problem of garbage collection (finding and reclaiming space allocated to inaccessible objects) has been well studied for almost 30 years. Algorithms can be roughly classified as explicit deletion (``It's my data and I'll throw it away when I want to!''), reference counting (``Will the last one out please turn off the lights?''), mark-and-sweep (``Unclaimed goods will be recycled''), and generational (``When the ashtray is full it's time to buy a new car''). Recently, there's been a resurgence of research in garbage collection spurred by two developments: distributed systems (``I can't throw this away because somebody in France may still want it'') and persistent programming languages (the Pampers problem: The only thing worse than garbage is persistent garbage). Well known garbage collection algorithms that work fine for physical or virtual memory are terrible when pointers can cross continents or disk cylinders. Interesting algorithms for a disk-based or distributed environment have been proposed (see me for references). Study some of these algorithms, and either suggest improvements or implement them and study their performance.

Consumer Reports:

Many people are generating software and making it freely available on the network for ``anonymous ftp.'' Often, there are several packages available for the same or similar purposes. Much of this software is worth exactly what it costs, but some of it is as good as, if not better than, expensive ``commercial'' products. Select two or more related programs and do a careful comparative critical review. Depending on the nature of programs, the review might be a benchmarking study of relative performance, an analysis of functionality or ease-of-use, or some combination of these factors.

One area of particular interest is file accessing and indexing packages (if this were cs764, I would call them low-level database facilities). Examples are the WISS and Exodus storage managers, both written here, and the dbm and libdb packages from Berkeley (the latter is in the yet-to-be-released 4.4BSD version of Unix, but we have a early version of this code).

A related suggestion is to compare implementations of Unix and alternative ways of achieving the same function in different ways. For example, consider the question, ``What's the best way to get data from one process to another?'' Under various flavors of Unix you can use TCP or UDP, Unix-domain sockets, pipes, fifo's, shared memory, files, or at least three different flavors of remote procedure call. The answer depends on the versions of Unix involved, and various characteristics of the communication desired (such as the amount of data to be transferred, the sizes of messages, whether replies are required, the degree of reliability needed, etc.) I've written a rough program that tests many of these techniques. I would like someone to polish the program a bit and use it to do an evaluation of many of the IPC mechanisms available.

Condor:

Condor is a locally-written utility that makes unused cpu power on idle workstations available for productive use. A daemon process on each workstation monitors activity and reports to a central resource manager. A client who wishes to run a long cpu-bound program contacts the resource manager to obtain the name of an idle workstation. It then contacts the selected server workstation and sends the job to be executed. Jobs to be run under Condor may be linked with a version of the C library that handles system calls specially: File I/O calls are turned into requests sent back to a shadow process running on the submitting host. If the server workstation should become non-idle before the job finishes, the job is checkpointed and restarted on another workstation in the pool. One user of Condor had a program successfully complete after consuming over 300 cpu days during a period that spanned the department's move to a new building!

For more information about Condor, see the Condor home page. If you are interested in a project related to Condor, talk to me and/or Miron Livny.

Specialized NFS Servers:

The Unix File System interface provides a convenient abstraction for a variety of data beyond ordinary files. For example, ``classic'' Unix makes I/O devices and communication channels (pipes) look like files. Some flavors of Unix support other kinds of objects that look like files, including network connections, ``named pipes'' and shared-memory regions. The Network File System (NFS) provides a convenient path for adding new kinds of ``file-like'' objects without modifying the operating system kernel. An NFS server running as user-level process can be ``mounted'' in the Unix name space. Any requests to open, read, or write files in this space are forwarded to the server. This trick is used in the CAPITL program-development environment and the SHORE object-oriented database system to allow access to database objects from ``legacy'' applications such as compilers, editors, grep , etc. without the need to modify, or even re-link them. I have written a package of C++ classes that encapsulate all the messy details of the NFS protocol to create a ``do it yourself'' NFS server kit. All you have to do is implement the necessary data structures to simulate Unix file behavior.

Use this kit to provide a Unix-compatible veneer over some other service. A representative example is FTP. Write a server that allows you to navigate the space of files accessible via anonymous FTP as if they were part of the local file system.

Shore:

Shore is an experimental object-oriented database being developed in our department. It combines many of the features of traditional databases (concurrency control and recovery and high-speed bulk operations), object-oriented databases (fine-grained strongly typed objects with identity), and file systems (a hierarchical name space with secure protection of objects and Unix compatibility). Write a persistent application using the facilities of Shore and critically evaluate how well it served your needs, or work to extend or improve Shore in some way (see me for ideas).

UW--Madison Research Projects:

Detailed descriptions of several of the research projects mentioned above (and more) are available via the CS Department Home Page . Most of the projects listed there would welcome participation by interested students.

Benchmarking Web Proxy Servers

Due to security and other concerns, many companies and institutions install Web proxy servers for their employees' web surfing. The proxy server usually runs on the fire-wall machine, and acts as the gateway between internal users and external web sites.

There are many commercial and free proxy server software available, including:

As the proxy server becomes more important, users need a way to choose from these proxy server software. However, though Web server benchmarking has received a lot of attention from industry, little has been done to benchmark proxy servers. Yet, since proxy servers are actually gateways, their usage patterns are likely to be very different from those of Web servers.

In this project, you are asked to design a benchmark from a set of proxy server activity logs. The benchmark maybe similar to Webstone, but should reflect the particular usage pattern on proxy servers. Then, you need to benchmark the freely available proxy servers, and those commercial ones that are free for a evaluation period (like Netscape's proxy server).

Optimized Cache Organization for Web Proxies

Many Web proxy servers come with caching options to reduce the traffic going out to the Internet. That is, once one user has seen a document, another user can get it from the proxy cache without going to the Internet. This has the advantage to reduce network bandwidth consumption.

However, at one large company with thousands of employees, the web proxy administrator told me that it is actually slower if they turn the caching option on in the proxy server. Apparently, the proxy server was saving cached documents as individual files in a three-level directory hierarchy, and the cost of pathname resolution and file retrieval in kernel outweighs the benefit of caching.

This raises an interesting research question. Traditional UNIX file systems are design for personal use of files; they are not designed for the situation when thousands of small files are stored on disk and retrieved constantly. Thus, proxy servers should probably avoid using file systems for Web page caching, but rather, store the files as non-uniform sized tuples in a relational database: the first column is URL, the second column is the document. Then, borrowing the techniques for fast indexing and efficient main-memory caching, one can hope to improve the caching performance of proxy servers significantly.

In this project, you are asked to build a simple prototype to test this idea quickly. You can start from an existing proxy server (like Apache or Squid), modify its web cache organization, and test the performance gain or losses.

Consistency protocols for Web caches

Since Web caching is becoming more wide-spread, the issue of cache consistency becomes an important issue. Current practices maintain only weak consistency among caches using protocols such as Time-To Live. A recent paper by Liu & Cao argues that maintaining strong consistency via invalidation is not much more expensive than maintaining weak consistency, and it is indeed feasible to maintain strong cache consistency for the whole web, despite its size.

The problem with the invalidation approach, however, is that sending many invalidation messages would take a long time. Thus, a hybrid protocol is proposed, where the web server would set an age field for a document, and guarantees to send out invalidation messages if the document is changed before its age expires.

In this project, you are asked to write a simulator for this consistency protocol that can take a web access trace as input and produce statistics about message sizes, number of invalidation messages, etc.

Cost-conscious network caching algorithms

Recently there have been many proposals on putting caches at network routers to reduce traffic. Proxy caching for the Web is one example. Caching at the routers provides an opportunity to adjust the load on network links through caches. Specifically, if the cache has the choice of caching page A or page B, and page A arrives at the cache through a slow/expensive link, while page B arrives at the cache through a fast/inexpensive link, then it is probably more desirable to cache A than to cache B.

It turns out that there is a theoretical online caching algorithm that takes the cost of getting a page into account, and the algorithm is quite simple (it is called Greedy-Dual). Recent studies show that for a single cache site, the algorithm can reduce the overall cost of missed pages much more than LRU. However, it is not clear what happens if more than one cache sites use the same algorithm, whether it will create a bottleneck on the fast link, etc. Thus, you are asked to write a simulator for the case when more than one cache sites use the algorithm, and test its performance, and maybe propose a new algorithm that adjusts the cost parameters as the network load varies.

Path profiling: a technique to catch back doors and other programmed threats

One of the major problems with operating system security is that an application bought or downloaded from other sources may do something quite unexpected at times. The something unexpected can be either a back door that bypasses authentication, or a logic bomb that destroys the system on a certain date, or a Trojan horse that appears doing one thing but actually does something else. Currently there is not much defense against these ``programmed threats'' except to buy things from trusted sources. But since all software vendors make no warranty about their software, it is little if any protection.

One effort to protect against these threats is to use path profiling. A binary-rewriting tools such as EEL (by Jim Larus) would instrument the binary and produce a trace of pages traversed during the application's execution. Then, to protect against back doors, we would ask the software vendor to give us the basic block codes of the authentication routines and those of the ``action'' routines (routines that change critical data), and then monitor the application in every execution to make sure that every time the ``action'' routines are reached, the path contains the basic block codes of the authentication routine.

Similarly, to protect against logic bombs, we would test-run the applications and record its typical path profiles and typical mixes of system calls. Then based on those ``fingerprints'' of typical behavior, we would catch and signal any abnormal actions of the application.

In this project, you are asked to start with a few UNIX utilities (such as passwd) and test whether this idea would work. Try get the applications' path profiles, then alter them to install a back door or logic bomb, and see whether the profiling would catch it.


solomon@cs.wisc.edu
Sat Feb 14 10:32:26 CST 1998