Project 6: Data Protection

Important Dates

Due: Tuesday 12/15/09
You can have a partner on this project, if you would like.
New: You must also send me a picture of yourself to get full credit for the project.

Objectives

There are four objectives to this assignment:

To learn more about file systems
To learn about checksums
To learn about raid protection schemes
To learn how interposition works

Overview

Data on disk gets corrupted over time -- but in this project, you are going to do something about it. How? By implementing a file system layer that adds an extra level of redundancy to the data you store in your file system. Sound simple? Good! You have the right attitude.

How are you going to do it? Well, that's where things get interesting. You are going to develop a file system library that interposes on all important file-system related calls and uses checksums and parity to ensure a higher level of data protection.

Files should be treated as a sequence of 4 KB blocks. Then, for each file that you access, you should compute and store a parity block for that file. Hence, each file should have a 4 KB parity block associated with it. If a block is later determined to be bad (e.g., not accessible or corrupted), you should then be able to transparently reconstruct the block from the other blocks of the file plus the parity block.

This leaves a problem: how do we tell when a file block has gone bad? Checksums are the answer. Specifically, for each 4KB block of the file, you should compute and store an MD5 hash of that block. Later, when reading the file back, you should read the MD5 hash too, and compare it to the hash of the block you just read. If they don't match, you have a bad block, and hence you need to use the parity block to reconstruct this block. If you have more than one bad block, you are out of luck -- reads to the bad blocks should simply return an error (return -1 on read and set errno to EIO).

So far we haven't said how you are going to get access to those file system open(), read(), write(), and other relevant calls. One way would simply to be to re-write the file system itself -- but that is a lot of work, and requires anyone who wants this feature to use your new file system. Instead, you will be building a dynamically-linked library (again!) which interposes on important system calls in order to allow you to do the extra work you need to do. You will call this library “libfsprotect.so” and it will add checksums and parity to whatever file system you use it upon.

Checksums

For checksums you will be using MD5. MD5 takes an input string and gives you a 128-bit fingerprint or checksum as an output. The files md5.c and md5.h have all the code you need.

For more information about md5, you might check out the unofficial MD5 homepage.

Redundancy

Parity is easy to compute using xor. In C, bitwise xor can be achieved with the ^ operator also known as a carat . In xor.c you will find a very simple example.

Interposition with a Dynamically-Linked Library

As before, you will develop your project as a dynamically-linked library. In your library, you will need to define your implementation of various important file system APIs, such as open(), close(), lseek(), read(), and write(). Don't forget about unlink(), truncate(), and ftruncate() too!

To do this, you will use the LD_PRELOAD functionality provided by the dynamic linker (read the man page for more information ( man ld.so )). For example, let's say you wanted to be passed control every time the system call close was called by a process.

You would first build your library to define its own close() routine. Here is a simple example.

You then need to build the library, just like you would build a typical dynamically-linked library.

prompt> gcc -shared -o libfsprotect.so -fpic -Wall libfsprotect.c

To interpose on the close() routine, use LD_PRELOAD as follows:

prompt> setenv LD_PRELOAD ./libfsprotect.so

Then, simply run something that you know opens (and then closes) a file:

prompt> cat /dev/null
closing fd: 3
prompt>

You can also use ldd to find out which libraries an executable is currently linked with:

prompt> ldd /bin/cat
./libfsprotect.so => ./libfsprotect.so (0xb75e6000)
linux-gate.so.1 => (0x0028e000)
libc.so.6 => /lib/libc.so.6 (0x00101000)
/lib/ld-linux.so.2 (0x00321000)
prompt>

To turn off this interposition, simply unset the LD_PRELOAD variable:

prompt> unsetenv LD_PRELOAD

Note: if you use bash (not tcsh), you use a slightly different set of commands to set and unset LD_PRELOAD.

One major problem with this example is that it doesn't actually call the real close() routine! Oops -- that means a bunch of files are now left open. The difficulty here is that within your close() routine, you can't simply call close() again -- the compiler/linker would think you are recursively calling the close() routine you defined, and not the one you mean (the one in libc.so). Hence, you must explicitly link with close() by using the dlopen() and dlsym() (use the man pages for more information).

Here is a simple example of a better version of close.

Also, make sure to link with libdl when you make your shared library.

prompt> gcc -shared -o libfsprotect.so -fpic -Wall libfsprotect.2.c -ldl

For your convenience, you can also add a constructor and destructor function to your dynamic library. The constructor is a function that is run when the library is loaded into a process but before any other routine is run, and the destructor gets called when the process is exiting. In this project, you might do things like call dlopen() in the constructor (and dlclose() in the destructor); any other initialization and tear-down code should be included in too.

Here is a simple example of the close library with a constructor and destructor.

Interposing on All Relevant Functions

One problem with interposition is that for your library to work on all sorts of real programs, you need to interpose on all relevant interfaces. For real programs, that means not just open(), close(), lseek(), read(), write(), unlink(), truncate(), and ftruncate(), but other (often undocumented) interfaces such as open64(). For this project, start with what's simple: open(), close(), lseek(), read(), and write(). Worry about unlink() and truncate()/ftruncate() later.

From there, try to figure out what real applications use. Tools like strace (run strace program on the command line to see what system calls program calls) can be very useful. Also, the program objdump might be handy, to look for symbols in the executable (the -t flag is handy).

However, how much time you spend on this is up to you -- if you get everything working for the basic interfaces (open(), close(), lseek(), read(), write(), unlink(), ftruncate()/truncate()), that will be good enough for full credit. Doing the fuller interface just lets you run more standard programs on top of your library.

Note: One interface you don't have to deal with is mmap(). mmap() allows processes to map a file into their address space and then access it as if it were memory. Catching all reads and writes to a file thus becomes the task of catching all loads and stores to a memory region of a process, which is painful (at best). Hence, in this project, you don't have to handle mmap().

Managing your Meta-data

In this project, your libfsprotect will have to manage some data (e.g., checksums, parity) about the files it is protecting. We will refer to this extra information as the meta-data that libfsprotect is managing.

Your job is to figure out how to manage this meta-data in a simple and efficient manner. One thing that will be assumed is that user has set an environment variable FSPROTECT_HOME to the directory where this metadata should live. If this variable isn't set, you should print an error message and exit. Use getenv() and setenv() to access this variable, and store any relevant information that you need about the files you are protecting in this directory.

Note: The way you manage your meta-data is one major design aspect of this project. You have complete freedom here, but use that freedom wisely. Think about what you need to store on disk in the FSPROTECT_HOME directory (parity, checksums) and then design a scheme that accomplishes that.

Bootstrapping: One problem you will encounter is that the proper meta-data (checksums, parity) for a file has not yet been created. In this case, when the file is accessed, you should go ahead and bootstrap the file; that is, you should compute the checksums and parity for the file, and store them properly in the FSPROTECT_HOME directory.

Other Notes

When can errors occur? It is best to assume that corruption errors can occur after an open() has taken place -- hence, read() may be called and require you to check checksums and use the parity block if need be. However, most of the tests will have the error occur before the given file is opened.

Simplification: For truncate() and ftruncate(), the file can only be shortened (assume it won't be lengthened), and it can only be shortened to a size that is a multiple of 4KB.

Complication: Realize that a file can be deleted with calls such as unlink() and open(with the O_TRUNC flag) and the truncate() and ftruncate() calls. Your library should handle such file deletion correctly.

Simplification: To make your life easier, you don't have to handle files that are not a multiple of 4KB in size (though you may wish to add checks to make sure the files are as you expect). You also don't have to worry about reads and writes that are not 4KB-aligned -- in other words, all reads/writes to files that you are managing will be a multiple of 4KB in size and will start at an offset that divides by 4KB. Both of these simplifications should remove a number of corner cases from your code.

Dealing with open(): open() is a bit of a pain to deal with because it takes a variable number of arguments. To handle it, you should check out the following code example here.

Turning It In

Place the code and Makefile to build your library in the usual place. If you have a partner, please put the code in both places to make our lives easier. Thanks!