CS-838: Mini-project #1

Important Dates

Due: 03/03

Overview

In this assignment, you will get your feet just a little bit wet with a computer system. One of the most important parts of computer systems is evaluation. Hence, understanding what it takes to perform such an evaluation is a skill we wish to develop.

What you're going to do:

First, you have to pick a platform to study. Any Unix-based system (such as a PC running some version of Linux) is acceptable, and any virtualization platform is also fine (e.g., Xen, KVM). For this assignment, please no Windows-based systems, unless you know what you are doing. Second, you're going to run some simple experiments, which you design to bring out various properties of the file system under test, in both virtualized and non-virtualized settings. Third, you are going to create some graphs to demonstrate those properties -- call them empirical proofs . Finally, you will write up what you did.

You can work in groups of two for this project.

More Detail

In this assignment, we're going to explore the inner-workings of the file system in both virtualized and non-virtualized settings. In a Unix-based file system, assume we have the following system calls to work with: open(), close(), read(), write(), lseek(), fsync().

Our main approach is going to be to write little code snippets that exercise the file system in different ways; then, by measuring how long various operations take, we are going to try to make some deductions about what the file system is doing. We'll then do the same when running on a virtual machine monitor of some kind, and see what differences arise. Comparison of the two results should lead to some insights as to how virtualization works, and its basic costs.

Step 0: Platform

Pick a platform you will work upon. Very likely it will be something like a PC running Linux, but please feel free to be adventurous -- it will keep my eyes open when reading your project on a system that is a bit different, e.g., FreeBSD, some ugly old Unix system like AIX, or even Mac OS X. However, please do use a Unix-based system.

Do a little research on the file system. Some systems, such as MacOS HFS and Linux XFS, use extents in the inode instead of blocks, making it almost impossible to measure blocksize. The file system layout may determine some of your experiments.

Do some research on the virtualization platform as well. How does it virtualize I/O, for example?

To measure information about the cache, you will need to be able to control what is in the cache and what is not. There are several ways of accomplishing this, both with and without root privilege.

Step 1: Timers

The accuracy and granularity of the timer you use will often have a large affect on your measurements. Therefore, you should use the best timer available. Fortunately, on x86 platforms, a highly accurate cycle counter is available. The instruction to use it is known as rdtsc and it returns a 64-bit cycle count. By knowing the cycle time, one can easily convert the result of rdtsc into a useful time. Some potential pitfalls:

If the processor can automatically vary the clock speed, the timestamp counter will not reflect real time.
On a multicore system, different processor cores may have different values for the timestamp; you can only compare values on a single core and not across cores.

Hence, the first thing you should do is: figure out how to use rdtsc or its analogue (you can use google to find out more about it). Once you know how to call it and get a cycle count, convert the result to seconds and measure how long something takes (e.g., a program that calls sleep(10) and exits should run for about 10 seconds). Confirm your results make sense by comparing it to a less accurate but reliable counter such as gettimeofday(). Note that confirmation of timer accuracy is hugely important! If you don't trust your timer, how can you trust the results of your measurements?

Step 2: Measuring the File System

After getting our timer in order, we will move on and measure some aspects of the file system proper. All measurements should be done on the local disk of some machine - do not measure the performance of a distributed file system such as AFS, where, for example, your CS account resides.

Furthermore, all measurements should be done on both the non-virtualized OS as well as a guest OS running on some kind of virtualized host. Comparison of the two results is the main purpose of the experiment; what happens to your answers as virtualization is introduced?

Through experiments that you design, implement, run, and measure, you are to answer the following questions:

What is the ideal buffer size for random file access? Hint: use reads of varying sizes and plot the time it takes to do such reads. Also, be wary of prefetching effects that often kick in during sequential reads.

During a sequential read of a large file, how much data is prefetched by the file system? Hint: time each read and plot the time per read.

How big is the file cache? Hint: Repeated reads to a group blocks that fit in cache will be very fast; repeated reads to a group of blocks that don't fit in cache will be slow. Think about how big the cache is likely to be compared to the amount of physical memory in the system.

For what file sizes does the file system add an additional layer of indirection? Usually, an inode can only hold pointers to a few blocks, and after that additional blocks must be read off disk containing more pointers. Hint: think about using write() and fsync() to answer this question. Also think about what happens when you extend a file and suddenly an indirect pointer must be allocated -- how many more writes occur at that point?

Hence, in your write-up, you should have one or more graphs which you use to directly answer the questions above.

Be critical of yourself -- are the conclusions you draw foolproof? Or are they mere hypotheses?

A major issue with any data collection is: how convincing are your numbers? How do you make them more convincing? How do you deal with experimental noise? etc. Use your common sense and be critical of your numbers -- do they really convince you that you know the answer?

One thing you will undoubtedly do is to use repetition to increase your confidence, i.e., you will take multiple measurements of an event, and compute (for example) an average over many runs instead of the result from just a single experiment. Be careful when computing averages over numbers -- make sure to always first look at all the data.

If you don't, you might use an average where an average doesn't make sense.

Step 3: Writing It Up

After you're done with experiments, you'll need to write up what you've done. What should go in your writeup? Here are some tips:

Title: The title should be descriptive and fit in one line across the page.

Author: Right under the title, this says who you are.

Abstract: This is the paper in brief and should state the basic contents and conclusions of the paper. The abstract is not the introduction to the paper (it should be shorter), but is a summary of everything. Read some of the abstracts of papers we've read for class to get a better idea. In general, the abstract is an advertisement that should draw the reader into your paper, without being misleading. It should be complete enough to understand what will be covered in the paper. This is a technical paper and not a mystery novel -- don't be afraid of giving away the ending!

Intro: A short overview of what you did, and what you learned. More motivation than the abstract, and more details. Again, make sure you include your main conclusions.

Methodology: How you measured what you measured. Include something about your timer accuracy here, as well as a description of the platform you are using to the level of detail such that someone else could reproduce the experiment elsewhere.

Results: This section should consist mainly of graphs, addressing each of the questions above. Make sure that graphs have axes labeled (including units). Also make sure to include the code snippets with each graph (or some rough description of them) so we have an idea what exactly you measured. Also, make sure to draw appropriate conclusions about each graph.

Conclusions: Summarize your conclusions here, and talk about what else you have learned in the process.

This paper should be at most 6 pages long (including everything), in 10 point or larger font, in double column format. In your write-up, you should not re-describe the assignment. Your paper must be written using proper English grammar and should have no spelling mistakes.

The paper will be graded as follows:

Presentation: 1/3 How well written and structured is the paper? Are the figures and tables legible?
Methodology: 1/3 Is the methodology sound? Will it accurately measure and return the correct results? Does the reader have confidence in your results?
Explanation: 1/3 Do you explain your results completely? Are all features of your results graph explained?

Here are hints on the writeup:

When proposing experiments, describe your hypothesis: how does the system work, how do you expose that behavior. Make a prediction on what should happen if your hypothesis is correct.
Don't use passive voice. E.g. the inode is read from disk. Say the FS reads the inode from disk.
When describing experiments, discuss the relevant details of the platform, uch as processor type and speed, memory size, OS name and kernel version.
When graphing, change the scale to highlight the useful stuff. If nothing happens at the top, chop it off. Log scale may be useful, but it tends to minimize large percentage differences.
Perform multiple runs to make sure the data is good. Depending on the test, you might want best case, average case, median case, or worst case. Understand why. When looking for phenomena, best case may work well. If using randomness, you may want the median or average case.
If you have data anomalies, explain why they occur. For example, if there is a sudden spike or downturn in graph, explain why
Read your paper through completely before handing it in. Spell check it as well (aspell and ispell are good programs to use with LaTex).
When including code fragments, only include the relevant code. For example, you can leave out most arithmetic or bookkeeping code that is not relevant to the computation.

We also recommend/require the following two tools: LaTeX for typesetting, and ploticus (or zplot?) for graphs.

Step 4: Turn it in

Please email me your paper by midnight on the project due date.

Step 5: Enjoying Yourself

Computer systems are complicated, and careful and accurate measurement is a tricky business. Make sure to have fun! How should you do that? Probably by starting early.

As always, feel free to ask questions or stop by office hours if you are having trouble. Good luck!