Project 1: Intro to Communication
Due: Mon, 9/25. Details below.
The project introduces you to the fundamentals of communication and RPC. It is to be done in groups of 2 or 3. The basic idea is to build a library that communicates reliably in the face of failure, and gain experience with an RPC package.
Part 0: Timing
Because the analysis of systems is part of everything you do in systems classes, the first thing you should learn is how to measure how long something takes. On Linux platforms, which we'll be using for this project, you can use clock_gettime() or gettimeofday() .
An important aspect of a timer is its precision (or resolution): how small of a time event can be measured with this timer accurately? One way to determine the resolution of the timer is to read the clock value at the start and end of a simple loop. Start with a single loop iteration, then increase the iteration count of the loop until the difference between the before and after samples is greater than zero. Try to get the smallest non-zero positive difference. If a single iteration of a loop takes too much time, try putting simple statements between the two timer calls. Beware compiler optimization, which, if you are not careful, will remove the code in the loop and yield odd results. Record your result.
Once you feel confident you can time events with precision, you are ready to move on to Part 1.
Part 1: Using Your Timer
In this part, you will use your newfound ability to time things to verify Jeff Dean's numbers you should know. See how many of these you can measure yourself, and make a table.
Part 2: Reliable Communications
The first real part of the project is to build a reliable communication library on top of raw UDP-based sockets. You have a choice of languages to do this in: C, C++, or Go. If all of this sounds complicated, read this chapter to begin.
Your communication library should allow two processes to communicate via UDP packets, but it should use a simple timeout-retry mechanism to detect when the receiver has not received a message, and then re-send that message. The sender should keep trying to send the message until it gets an ack from the receiver. Your send code should be blocking , i.e., it should not return until an ack has been received.
After you have this simple layer working, you will measure its performance and reliability characteristics. How much overhead is there to send a message? (similar to the overhead discussed in the U-net paper) What is the total round trip time of sending a message and receiving an ack, when running on a single machine, and when running on two separate machines? What is the bandwidth of your library, when sending a large number of max-sized packets between sender and receiver, again on when running on the same machine, and when running on different machines? What limits your bandwidth, and how could you do better?
As for reliability, you need to induce controlled message drops to show how your layer works. Arrange for your receiver library code to have an input that tells it what percent of messages it should drop (randomly). If this number is set to 10 percent, for example, your receive-side code should randomly drop 10 percent of messages (naturally); setting the number to 0 makes the layer reliable (no drops). Send a stream of packets, with the reliability percentage set to something non-zero, and measure the round-trip time of each reliable send; in a resulting graph, you should be able to see some high values where time-outs and retries occur. How many different performance regimes result?
When running your experiments, compile your library and test code both with and without optimization enabled (i.e., -O). How much difference does this make in your performance results?
Part 3: Google RPC and Apache Thrift
Next, you'll measure the overhead of marshalling a message (e.g., packing an item into a protobuf). How long does it take to pack an int, a double, a string (of varying size), a complex structure on each platform? How confident are you about your measurements?
Now measure round-trip time for a small message, when both client/server are on the same machine, and when on different machines. How long do request/responses take? Is the first round trip much slower than subsequent ones? How much overhead does using protobufs and RPC take, as compared to your barebones RPC library?
Now measure bandwidth when sending large amounts of data. How quickly can grpc send large amounts of data from one machine to another? What total bandwidth is achievable, when using server streaming or client streaming ? What about thrift? How large do messages have to be to reach peak line rate (as measured in the U-net paper)?
When running your experiments, compile grpc and your test code both with and without optimization enabled (i.e., -O). How much difference does this make in your performance results?
Machines To Use
For this project, you just need to find two Linux machines that can speak to one another on a network. Any of the CSL lab machines should work fine. You can remotely log into one or two (as needs dictate) and do your work from there. If you have access to other machines, that is also fine; for example, most of the cloud service providers give out free credits for educational use. Or, pay for them! They are cheap.
Warning: if using CSL machines, please clean up after you are done. Specifically, don't leave server processes running; make sure to run ps auxw (and grep ) to find what processes you have left running and then kill or killall to halt them.
Handing It In
To turn this project in, you'll just meet with me and bring some graphs describing what you have measured. You'll then explain what you did via graphs, showing me measurements and results as outlined above. We'll have a signup sheet as the date of the class approaches, and I'll also give a little more detail in class.