UW-Madison
Computer Sciences Dept.

CS739 Spring 2005: Projects

Autonomic Systems

A major research area is developing complex systems are easier to manage. We need innovations in a number of different areas for this to happen.

Deriving Causal Paths

For either a human (or for the system itself) to "understand" behavior, its useful to know the relationship between different requests or messages in the system. We've read a number of different papers that describe techniques for deriving the causal paths in a distributed system. In this project, you will develop extensions to those techniques. There are a number of options here.

  • Apply the technique of delaying packets (used previously to analyze the Centera) to the distributed system of your choice.
  • Adapt the technique of delaying packets so that it will work on-line. If there are multiple concurrent requests, will delayed messages still help you infer relationships? How can you make the overhead acceptable in an on-line system? For example, can you find an acceptably short delay period or delay only a small sample of messages?
  • What additional information is available to help infer causality? Does examining the content of messages help one determine that a message is dependent on a previous one? For example, if one sees that a subset of the data is copied from one message to the next, one can infer that they are likely to be dependent. If this information is useful, how can this searching be done efficiently?

Monitoring and Changing Behavior

Proxies are useful in a distributed system for both monitoring and changing the behavior of the system. Proxies are often used for filtering or load-balancing; what else can they be used for?

  • To determine whether or not a primary system is behaving correctly, a proxy could replicate requests to a backup/secondary system as well. The proxy can then compare the outputs of the two to determine if there is a problem. However, when performing this comparison, one must be able to filter out unimportant differences, such as timing. In this project, you will build a comparator for the distributed system of your choice.
  • A proxy can be used to convert one protocol to another. For example, NFS is the lingua franca of distributed file systems -- virtually every known operating system can mount NFS volumes. However, in certain departments and work environments, other file systems are used (e.g., here we use AFS). In this project, you seek to provide access to these other file systems but do so without requiring new client-side software. Instead, you will build an NFS-to-AFS "bridge" that transforms NFS requests into their meaningful AFS counterparts. Thus, you will have a machine that sits there and takes NFS requests and passes them onto AFS servers. Many issues arise: performance, consistency, and security in particular come to mind. Alternatively, one could convert NFS to HTTP, or vice versa.
Fault-Tolerance

Understanding Failures

One of the keys to building a distributed system is having it operate correctly when a node fails. However, less attention is generally paid to handling subsystem failures: for example, when a single process dies, or when a disk block or memory chip either fails or returns incorrect data. For example, what happens to Linux (or perhaps an NFS server running on Linux) when you corrupt a data structure within it? In this project, you will characterize how the system of your choice handles a range of these more interesting failures.

Improving Fault-Tolerance

Microreboots and Failure-oblivious computing are two new techniques that have been proposed for improving the reliability of servers. How can these ideas be extended?

  • Can you apply either of these techniques to a modern OS? For example, can you build a rebootable file system within Linux? Or could you alter Linux so as to use the failure-oblivious computing infrastructure?
  • Is it possible to implement a hybrid of these two techniques? For example, while a component is being rebooted (or microrebooted) can they system return manufactured values?
  • Where else is failure-obliviousness useful? For example, the authors apply this idea to memory; is there an analog for disks? (e.g., when the disk fails or the user tries to read past the end of a file, can you just "manufacture" a result and continue computing?)
New Applications

The peer-to-peer community seems to still be searching for its killer application. Can you implement one? You may want to build your application on top of one of the existing DHT implementations (e.g., Bamboo or OpenDHT). Some possibilities for applications include:

  • Personal Communication References Tired of reading papers that refer to personal communication? This service would allow researchers to enter an extended quote from someone that they wish to cite; others would then be able to verify the exact original quote.
  • Mail System Spam is a big problem. We can pretend to fix it by trying fancy new filters or other bandaids, but the real problem lies in the basic construction of the system. So let's junk the old system and make a new one! In this project, you will do just that. Build a P2P system avoids spam as a first principle. One thing you could do is to include a strong concept of identity in it -- in other words, to join this mail system, someone has to let you join (say, a friend). Then, if you start sending mails people don't like, they can trace it to you (and also, to your friend), and kick one or both of you off of the system. There are other ways to approach the problem too, for example, by requiring computation on the part of the sender to be able to send something to a receiver. In any case, there is a huge space of problems that could be attacked here. You may want to start with the epost code base.
  • Video Indexing There is so much video out there on the web, and it is growing. In this project, you will build a p2p overlay that indexes all the video out there based on its audio stream. A couple of approaches are possible: use speech-to-text technology, or grab the close-captioned text and use that.
  • Overcite A repository for academic papers, similar to citeseer. This has been suggested by other researches, but I don't believe it is available yet.
  • Fundamental Algorithms There are a wealth of distributed algorithms in the literature, but their behavior under realistic assumptions is often not well understood. In this project, you will implement any traditional distributed algorithm (e.g., Paxos) and evaluate how it behaves in the "real world". How scalable is it? How does it perform under failure? How does it react to network delays?

 
Computer Sciences | UW Home