The goal for this project is to implement a reliable distributed system.
You should do this project in groups of 3 or 4.
The service to provide is a simple key-value store, where keys and values are string. You service should be as consistent as possible, so that requesting a value should return the most recently value set as often as possible. Furthermore, your service should tolerate failures of a process, node, or network.
The service should run on between 1 and 4 nodes, and should continue to provide service when only some of the nodes are available. You can assume the complete set of nodes is provided when your service starts. A client should be able to request service from any machine running your service.
You can assume that at least one node will always be running, so you do not need to store data persistently (unless you want to). You can also assume that the total size of all the data will be small (less than 10 megabytes).
The goal of this work is to get experience implementing consistent, fault-tolerant services. Hence, the primary criteria for your work is the ability whether your service can continue to provide consistent results in the presence of failures.
To implement this, you will need to implement some form of replication, which ensure that data is available even when one of the machines goes down.
You may also want to implement some kind of failure detector, so your service knows when its peers are unavailable. Here is a bibliography of failure detectors.
You may find it helpful to read about Amazon's Dynamo system.
The four servers all run in distinct virtual machines. The client runs on a standard CSL workstation. The client and the four VMs are the only machines in the system.
The specification of the service is on the Project 1 Specification page. This will initially be a proposed specification, in that the class may edit it to converge on a better one.
You can assume that:
Your service should tolerate:
In cases where the servers do not fail, your service should provide strong consistency even if the client is unable to communicate with all the servers.
You may use any implementation language for the server and client, as long as it implements the required command line interface and protocol. For example, you can use regular sockets, RPC Google's protocol buffers for communication between your servers, or use the existing HTTP implementation. We will provide a simple implementation of a web server from which you may start, if you wish.
You should write a script to start/configure your server.
You should test your code with VirtualBox. To simulate multiple physical machines, you can run multiple virtual machines simultaneously.
There are pre-configured 32 bit Ubunto 11 images available in AFS: /p/course/cs739-swift/public/projects/p1/images. The source iso is in the adjacent 'iso' directory. Instructions on how to use the image are in a readme.
Here are some standard distributed system techniques you may want to look at:
In addition to developing the service, you will also develop tests that should work with any service that implements the specification. Each group will be asked to test the services of a few other groups.
Your tests should be able to verify the consistency/fault tolerance properties of a service implementation. For example, if you test a project that claims to provide perfect consistency and availability in the presence of partition failures, you could try concurrently writing to both sides of the partition and look for inconsistencies.
These are the tests that were used last year:
What to turn in
On the first turn-in day, turn in your service for other people to test
This should include:
You should send email to the instructor a location where other groups can download your service in a virtual machine.
On the second turn-in day
It may be that your tests don't work on other services, or that others services don't work for your tests. You should cooperate with the other groups to fix both your services and tests to be as useful as possible. When you turn in the results, please include a short description of changes you had to make to your tests and changes other groups had to make to their service.
Performance is not the primary concern of this project. We are instead interested in the ability to return consistent results under failure conditions.
In addition, we will consider (as secondary considerations):
For the writeup, we will look for these things: