CS 537: Fall 2005

Programming Assignment 4: Protecting your data


Due: Thursday, December 15th, at 10pm
You are to do this project BY YOURSELF
This project must be implemented in C

Notes

12/15 -- Some tests to expect:

No file exists: use your library to ...
open 'foo' O_WRONLY, write out 1 block, close, open, read
open 'foo' O_WRONLY, write 1, close, corrupt 1 block, open, read
open 'foo' O_WRONLY, write 10, close, corrupt 1 in middle, open, read all
open 'foo' O_WRONLY, write 10, close, corrupt 1 in middle, open, lseek to "bad" block, read 1 block
open 'foo' O_WRONLY, write 10, close, corrupt 2 in middle, open, read all (should fail)
open 'foo' O_WRONLY, write 10, close, corrupt 2 in middle, open, lseek to "bad" block, read 1 block (should fail)
open/write/close 'foo' (size 10), open, lseek to middle block, write 1, close, open and read back all
open/write/close 'foo' (size 10), open, lseek to middle block, write 1, close, corrupt 1 other block, open and read back all

Files kept open a long time: use your library to
open 10 block file, corrupt 1 block, read it, close
open 10 block file, corrupt 3 blocks, read them (should fail), close
open 10 block file, corrupt 1 block, read it, corrupt another, close

Truncating/Unlinking: use your library to ...
open/write/close 'foo' size 10 blocks, unlink, try to open it (should fail)
open/write/close foo size 10, truncate to size 0, open and read all
open/write/close foo size 10, truncate to size 1 block, open and read all
open/write/close foo size 10, truncate to size 3 block, corrupt 1 block, open and read all
open/write/close foo size 10, truncate to size 3 block, corrupt 2 blocks, open and read all (should fail)

Scale tests: use your library to ...
open/write/close 100 files of size 1 block, read them all back
open/write/close 10000 files of size 1 block, read them all back
open/write/close 100 files of size 10000 blocks, read them all back
open 100 files, write 1 block to each, close 100 files, read them all back
open 1000 files, write 1 block to each, close 1000 files, read all back
create 10000 files of size 1 block (this time without your library), read
them all (with your library)
open/write/close 10000 files of size 3 blocks, corrupt 1 block of each, read them all back
open/write/close 10000 files of size 3 blocks, corrupt 2 blocks of each, read them all back (should all fail)
open 1 10-block file 10 times, read same file through each descriptor
open 1 10-block file 10 times, read/write file through each descriptor

File 'foo' (size 10 blocks) exists but has never been accessed thru your library:
open 'foo', read all 1 block @ a time, close
open 'foo', read all, close, corrupt 1 block of 'foo', read all again
open 'foo', read all, close, corrupt 1 block, read just that block
open 'foo', read all, close, corrupt 2 blocks, try to read (should fail)
open 'foo', read all, 2 blocks @ a time

Left to you to test without guidance: error cases ...
e.g., open a file O_WRONLY (write only), try to read a block (all using your library), or having FSPROTECT_HOME not set, or ...

12/12 -- When can errors occur? It is best
to assume that corruption errors can occur after
an open() has taken place -- hence, read() may
be called and require you to check checksums and
use the parity block if need be. However, most
of the tests will have the error occur before the
given file is opened.

12/6 -- One more simplification: For truncate() and
ftruncate(), the file can only be shortened (assume it won't
be lengthened), and it can only be shortened to a size that
is a multiple of 4KB.

12/5 -- Complications: Realize that a file can be deleted
with calls such as unlink() and open(with the O_TRUNC flag)
and the truncate() and ftruncate() calls. Your library should
handle such file deletion correctly.

12/5 -- Simplifications: To make your life easier, you don't
have to handle files that are not a multiple of 4KB in size
(though you may wish to add checks to make sure the files are
as you expect). You also don't have to worry about reads and
writes that are not 4KB-aligned -- in other words, all reads/writes
to files that you are managing will be a multiple of 4KB in
size and will start at an offset that divides by 4KB. Both of
these simplifications should remove a number of corner cases
from your code.

12/1 -- Dealing with open(): Open is a bit of a pain to deal
with because it takes a variable number of arguments. To
handle it, you should check out the following code example
here.

Objectives

There are four objectives to this assignment:
To learn more about file systems
To learn about checksums
To learn about raid protection schemes
To learn how interposition works

1.0 Overview

Data on disk gets corrupted over time -- but in this project, you are
going to do something about it. How? By implementing a file system layer
that adds an extra level of redundancy to the data you store in your file
system. Sound simple? Good! You have the right attitude.

How are you going to do it? Well, that's where things get interesting.
You are going to develop a file system library that interposes on
all important file-system related calls and uses checksums and parity
to ensure a higher level of data protection.

Files should be treated as a sequence of 4 KB blocks. Then, for each file
that you access, you should compute and store a parity block for that file.
Hence, each file should have a 4 KB parity block associated with it. If a
block is later determined to be "bad" (e.g., not accessible or corrupted),
you should then be able to transparently "reconstruct" the block from the
other blocks of the file plus the parity block.

This leaves a problem: how do we tell when a file block has gone bad?
Checksums are the answer. Specifically, for each 4KB block of the file,
you should compute and store an MD5 hash of that block. Later, when reading
the file back, you should read the MD5 hash too, and compare it to the hash
of the block you just read. If they don't match, you have a "bad" block,
and hence you need to use the parity block to reconstruct this block. If
you have more than one bad block, you are out of luck -- reads to the bad
blocks should simply return an error (return -1 on read and set errno to EIO).

So far we haven't said how you are going to get access to those file system
open(), read(), write(), and other relevant calls. One way would simply to
be to re-write the file system itself -- but that is a lot of work, and
requires anyone who wants this feature to use your new file system. Instead,
you will be building a dynamically-linked library (again!) which interposes
on important system calls in order to allow you to do the extra work you
need to do. You will call this library "libfsprotect.so" and it will add
checksums and parity to whatever file system you use it upon.

That's about it! Now, for some details.

2.0 Checksums

For checksums you will be using MD5. MD5 takes an input string and gives you a
128-bit "fingerprint" or checksum as an output. The files md5.c
and md5.h have all the code you need.

For more information about md5, you might check out the following:

The unofficial MD5 homepage
Source code example (as above)

3.0 Redundancy

Parity is easy to compute using xor. In C, bitwise xor can be achieved
with the ^ operator also known as a "carat". In xor.c
you will find a very simple example.

4.0 Interposition with a Dynamically-Linked Library

As before, you will develop your project as a dynamically-linked library.
In your library, you will need to define your implementation of various
important file system APIs, such as open(), close(), lseek(), read(), and
write(). Don't forget about unlink(), truncate(), and ftruncate() too!

To do this, you will use the LD_PRELOAD functionality provided by the
dynamic linker (read the man page for more information ("man ld.so")).
For example, let's say you wanted to be passed control every time
the system call "close" was called by a process.

You would first build your library to define its own close() routine.
Here is a simple example.

You then need to build the library, just like you would build a typical
dynamically-linked library.

prompt> gcc -shared -o libfsprotect.so -fpic -Wall libfsprotect.c
To "interpose" on the close() routine, use LD_PRELOAD as follows:
prompt> setenv LD_PRELOAD "./libfsprotect.so"
Then, simply run something that you know opens (and then closes)
a file:
prompt> cat /dev/null
closing fd: 3
prompt>

You can also use "ldd" to find out which libraries an executable
is currently linked with:

prompt> ldd /s/std/bin/cat
./libfsprotect.so => ./libfsprotect.so (0xb75e6000)
libc.so.6 => /lib/tls/libc.so.6 (0xb749a000)
libdl.so.2 => /lib/libdl.so.2 (0xb7497000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0xb75e9000)
prompt>
To turn off this interposition, simply unset the LD_PRELOAD variable:
prompt> unsetenv LD_PRELOAD
One major problem with this example is that it doesn't actually call
the real close() routine! Oops -- that means a bunch of files are
now left open. The difficulty here is that within your close() routine,
you can't simply call close() again -- the compiler/linker would think
you are recursively calling the close() routine you defined, and not the
one you mean (the one in libc.so). Hence, you must explicitly link
with close() by using the dlopen() and dlsym() (use the man pages for
more information).

Here is a simple example of a better version of close.

Also, make sure to link with libdl when you make your shared library.

prompt> gcc -shared -o libfsprotect.so -fpic -Wall libfsprotect.2.c -ldl
For your convenience, you can also add a "constructor" and "destructor"
function to your dynamic library. The constructor is a function that is
run when the library is loaded into a process but before any other routine
is run, and the destructor gets called when the process is exiting.
In this project, you might do things like call dlopen() in the constructor
(and dlclose() in the destructor); any other initialization and tear-down
code should be included in too.

Here is a simple example of the close library
with a constructor and destructor.

5.0 Interposing on All Relevant Functions

One problem with interposition is that for your library to work on all sorts
of real programs, you need to interpose on all relevant interfaces. For real
programs, that means not just open(), close(), lseek(), read(), write(),
unlink(), truncate(), and ftruncate(), but other (often undocumented)
interfaces such as open64(). For this project, start with what's simple:
open(), close(), lseek(), read(), and write(). Worry about unlink() and
truncate()/ftruncate() later.

From there, try to figure out what real applications use. Tools like
strace (run "strace program" on the command line to see what system
calls "program" calls) can be very useful. Also, the program "objdump"
might be handy, to look for symbols in the executable (the -t flag is handy).

However, how much time you spend on this is up to you -- if you get
everything working for the basic interfaces (open(), close(), lseek(),
read(), write(), unlink(), ftruncate()/truncate() ), that will be
good enough for full credit. Doing the fuller interface just
lets you run more standard programs on top of your library.

Note: One interface you don't have to deal with is mmap().
mmap() allows processes to map a file into their address space
and then access it as if it were memory. Catching all reads and
writes to a file thus becomes the task of catching all loads and
stores to a memory region of a process, which is painful (at best).
Hence, in this project, you don't have to handle mmap().

6.0 Managing your Meta-data

In this project, your libfsprotect will have to manage some data
(e.g., checksums, parity) about the files it is protecting. We will
refer to this extra information as the "meta-data" that libfsprotect
is managing.

Your job is to figure out how to manage this meta-data in a simple
and efficient manner. One thing that will be assumed is that user
has set an environment variable FSPROTECT_HOME to the directory where
this metadata should live. If this variable isn't set, you should
print an error message and exit. Use getenv() and setenv() to access
this variable, and store any relevant information that you need
about the files you are protecting in this directory.

Note: The way you manage your meta-data is one major design
aspect of this project. You have complete freedom here, but use that
freedom wisely. Think about what you need to store on disk in the
FSPROTECT_HOME directory (parity, checksums) and then design a scheme
that accomplishes that.

Bootstrapping: One problem you will encounter is that the proper
meta-data (checksums, parity) for a file has not yet been created. In
this case, when the file is accessed, you should go ahead and "bootstrap"
the file -- that is, you should compute the checksums and parity for
the file, and store them properly in the FSPROTECT_HOME directory.

7.0 Extra Credit

If you do all of the above, you will receive full credit. However,
there are a couple of extra issues you can work on if you wish to
receive some extra credit for this assignment.

Dealing with Multiple Processes: If multiple processes are
accessing protected data at the same time, there is a chance that
multiple concurrent updates will occur to your meta-data. Such
a race condition could lead to inconsistent meta-data, which would
then perhaps lead to corrupt data being given back to the user.
Design your system with file locking to avoid this problem.

Dealing with Crashes: If a process using your library crashes
while updating data but (for example) before updating the checksums
or parity information, you could have an on-disk consistency problem,
which again could lead to the user receiving bad data. Design your
system to handle the case where a crash occurs in an untimely manner.

8.0 Turning It In

Please tell me you know how to turn the project in by now!