2/13:
If you want to clear the file system cache, an easy way to do so is to first unmount the file system and then to remount it (umount/mount). Of course, you have to be super-user to do this.
2/13:
Some of the tests (particular anything that writes to disk) will be easiest on top of an older, simpler file system such as ext2. Modern file systems such as ext3, ReiserFS, etc., are journaling file systems, and hence introduce additional write traffic into the picture. (we'll hear more about journaling later in the semester). That said, if you still can manage on top of these more complex systems, all the better.
2/13:
Note that more recent versions of Linux do
not
use an entirely LRU-like replacement policy.
What you're going to do:
First, you have to pick a platform to study. Any Unix-based system (such as a PC running some version of Linux) is acceptable. For this assignment, please no Windows-based systems, unless you also do some Unix-based system in addition. Second, you're going to run some simple experiments, which you design to bring out various properties of the file system under test. Third, you are going to create some graphs to demonstrate those properties -- call them "empirical proofs". Finally, you will write up what you did.
You are to work alone on this project.
Talking to your friends at some level is OK, but this should primarily be an exercise for you. Part of the process is to develop some measurement skills now for later use in your final project, so do a good job here and it will pay off down the road.
Our main approach is going to be to write little code snippets that exercise the file system in different ways; then, by
measuring
how long various operations take, we are going to try to make some deductions about what the file system is doing.
Hence, the first thing you should do is: figure out how to use rdtsc or its analogue (you can use google to find out more about it). Once you know how to call it and get a cycle count, convert the result to seconds and measure how long something takes (e.g., a program that calls sleep(10) and exits should run for about 10 seconds. Confirm your results make sense by comparing it to a less accurate but reliable counter such as
gettimeofday.
Note that confirmation of timer accuracy is hugely important!
If you don't trust your timer, how can you trust the results of your measurements?
Through experiments that you design, implement, run, and measure, you are to answer the following questions:
How big is the block size used by the file system to read data? Hint: use reads of varying sizes and plot the time it takes to do such reads. Also, be wary of prefetching effects that often kick in during sequential reads.
During a sequential read of a large file, how much data is prefetched by the file system? Hint: time each read and plot the time per read.
How big is the file cache? Hint: Repeated reads to a group blocks that fit in cache will be very fast; repeated reads to a group of blocks that don't fit in cache will be slow.
How many direct pointers are in the inode? Hint: think about using write() and fsync() to answer this question. Also think about what happens when you extend a file and suddenly an indirect pointer must be allocated -- how many more writes occur at that point?Hence, in your write-up, you should have one or more graphs which you use to directly answer the questions above. Be critical of yourself -- are the conclusions you draw foolproof? Or are they mere hypotheses?
A major issue with any data collection is: how convincing are your numbers? How do you make them more convincing? How do you deal with experimental noise? etc. Use your common sense and be critical of your numbers -- do they really convince you that you know the answer?
One thing you will undoubtedly do is to use
repetition
to increase your confidence, i.e., you will take multiple measurements of an event, and compute (for example) an average over many runs instead of the result from just a single experiment. Be careful when computing averages over numbers -- make sure to always first
look at all the data.
If you don't, you might use an average where an average doesn't make sense.
Title: The title should be descriptive and fit in one line across the page.
Author: Right under the title, this says who you are.
Abstract: This is the paper in brief and should state the basic contents and conclusions of the paper. The abstract is not the introduction to the paper (it should be shorter), but is a summary of everything. Read some of the abstracts of papers we've read for class to get a better idea. In general, the abstract is an advertisement that should draw the reader into your paper, without being misleading. It should be complete enough to understand what will be covered in the paper. This is a technical paper and not a mystery novel -- don't be afraid of giving away the ending!
Intro: A short overview of what you did, and what you learned. More motivation than the abstract, and more details. Again, make sure you include your main conclusions.
Methodology: How you measured what you measured. Include something about your timer accuracy here, as well as a description of the platform you are using to the level of detail such that someone else could reproduce the experiment elsewhere.
Results: This section should consist mainly of graphs, addressing each of the questions above. Make sure that graphs have axes labeled (including units). Also make sure to include the code snippets with each graph (or some rough description of them) so we have an idea what exactly you measured. Also, make sure to draw appropriate conclusions about each graph.
Conclusions: Summarize your conclusions here, and talk about what else you have learned in the process.This paper should be at most 6 pages long (including everything), in 10 point or larger font, in double column format. In your write-up, you should not re-describe the assignment. Your paper must be written using proper English grammar and should have no spelling mistakes.
LaTeX Use this for typesetting your document.
gnuplot Use this for making graphs.Check out ~remzi/public/Example/ for an example of how to use these tools. LaTeX is an excellent system for writing academic/scientific papers with, and it is worth spending some amount of time learning how to use it. Using gnuplot or something like it (e.g., Ploticus ) also makes a lot of sense as these tools produce nice encapsulated postscript (eps) files to use within LaTeX.