CS 537
Lecture Notes, Part 11
More About File Systems

Previous File Systems
Next Security
Contents

This web page extends the previous page with more information about the implementation of file systems.

Long File Names

The Unix implementation described previously allows arbitrarily long path names for a files, but each component is limited in length. In the original Unix implementation, each directory entry is 16 bytes long: two bytes for the inumber and 14 bytes for a path name component. ¹


    class Dirent {
        public short inumber;
        public byte name[14];
    }

If the name is less than 14 characters long, trailing bytes are filled with nulls (bytes with all bits set to zero--not to be confused with ‘0’ characters). An inumber of zero is used to mark an entry as unused (inumbers for files start at 1).

To look up a name, search the whole directory, starting at the beginning.
To “remove” an entry, set its inumber flag to zero.
To add an entry, search for an entry with a zero inumber field and re-use it. If there aren't any, add an entry to the end (making the file 16 bytes bigger).

This representation has one advantage.

It is very simple. In particular, space allocation is easy because all entries are the same length.

However, it has several disadvantages.

Since an inumber is only 16 bits, there can be at most 65,535 files on any one disk.
A file name can be at most 14 characters long.
Directories grow, but they never shrink.
Searching a very large directory can be slow.

The people at Berkeley, while they were rewriting the file system code to make it faster, also changed the format of directories to get rid of the first two problems (they left the remaining problems unfixed). This new organization has been adopted by many (but not all) versions of Unix introduced since then.

The new format of a directory entry looks like this:²


    class DirentLong {
        int inumber;
        short reclen;
        short namelen;
        byte name[];
    }

The inumber field is now a 4-byte (32-bit) integer, so that a disk can have up to 4,294,967,296 files. The reclen field indicates the entire length of the DirentLong entry, including the 8-byte header. The actual length of the name array is thus reclen - 8 bytes. The namelen field indicates the length of the name. The remaining space in the name array is unused. This extra padding at the end of the entry serves three purposes.

It allows the length of the entry to be padded up to a multiple of 4 bytes so that the integer fields are properly aligned (some computer architectures require integers to be stored at addresses that are multiples of 4).
The last entry in a disk block can be padded to make it extend to the end of the block. With this trick, Unix avoids entries that cross block boundaries, simplifying the code.
It supports a cute trick for coalescing free space. To delete an entry, simply increase the size of the previous entry by the size of the entry being deleted. The deleted entry looks like part of the padding on the end of the previous entry. Since all searches of the directory are done sequentially, starting at the beginning, the deleted entry will effectively “disappear.” There's only one problem with this trick: It can't be used to delete the first entry in the directory. Fortunately, the first entry is the ‘.’ entry, which is never deleted.
To create a new entry, search the directory for an entry that has enough padding (according to its reclen and namelen fields) to hold the new entry and split it into two entries by decreasing its reclen field. If no entry with enough padding is found, extend the directory file by one block, make the whole block into one entry, and try again.

This approach has two very minor additional benefits over the old scheme. In the old scheme, every entry is 16 bytes, even if the name is only one byte long. In the new scheme, an name uses only as much space as it needs (although this doesn't save much, since the minimum size of an entry in the new scheme is 9 bytes--12 if padding is used to align entries to integer boundaries). The new approach also allows nulls to appear in file names, but other parts of the system make that impractical, and besides, who cares?

Space Management

Block Size and Extents

All of the file organizations I've mentioned store the contents of a file in a set of disk blocks. How big should a block be? The problem with small blocks is I/O overhead. There is a certain overhead to read or write a block beyond the time to actually transfer the bytes. If we double the block size, a typical file will have half as many blocks. Reading or writing the whole file will transfer the same amount of data, but it will involve half as many disk I/O operations. The overhead for an I/O operations includes a variable amount of latency (seek time and rotational delay) that depends on how close the blocks are to each other, as well as a fixed overhead to start each operation and respond to the interrupt when it completes.

Many years ago, researchers at the University of California at Berkeley studied the original Unix file system. They found that when they tried reading or writing a single very large file sequentially, they were getting only about 2% of the potential speed of the disk. In other words, it took about 50 times as long to read the whole file as it would if they simply read that many sequential blocks directly from the raw disk (with no file system software). They tried doubling the block size (from 512 bytes to 1K) and the performance more than doubled! The reason the speed more than doubled was that it took less than half as many I/O operations to read the file. Because the blocks were twice as large, twice as much of the file's data was in blocks pointed to directly by the inode. Indirect blocks were twice as large as well, so they could hold twice as many pointers. Thus four times as much data could be accessed through the singly indirect block without resorting to the doubly indirect block.

If doubling the block size more than doubled performance, why stop there? Why didn't the Berkeley folks make the blocks even bigger? The problem with big blocks is internal fragmentation. A file can only grow in increments of whole blocks. If the sizes of files are random, we would expect on the average that half of the last block of a file is wasted. If most files are many blocks long, the relative amount of waste is small, but if the block size is large compared to the size of a typical file, half a block per file is significant. In fact, if files are very small (compared to the block size), the problem is even worse. If, for example, we choose a block size of 8k and the average file is only 1K bytes long, we would be wasting about 7/8 of the disk.

Most files in a typical Unix system are very small. The Berkeley researchers made a list of the sizes of all files on a typical disk and did some calculations of how much space would be wasted by various block sizes. Simply rounding the size of each file up to a multiple of 512 bytes resulted in wasting 4.2% of the space. Including overhead for inodes and indirect blocks, the original 512-byte file system had a total space overhead of 6.9%. Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the overhead would be 22.4% and with 4k blocks it would be 45.6%. Would 4k blocks be worthwhile? The answer depends on economics. In those days disks were very expensive, and a wasting half the disk seemed extreme. These days, disks are cheap, and for many applications people would be happy to pay twice as much per byte of disk space to get a disk that was twice as fast.

But there's more to the story. The Berkeley researchers came up with the idea of breaking up the disk into blocks and fragments. For example, they might use a block size of 2k and a fragment size of 512 bytes. Each file is stored in some number of whole blocks plus 0 to 3 fragments at the end. The fragments at the end of one file can share a block with fragments of other files. The problem is that when we want to append to a file, there may not be any space left in the block that holds its last fragment. In that case, the Berkeley file system copies the fragments to a new (empty) block. A file that grows a little at a time may require each of its fragments to be copied many times. They got around this problem by modifying application programs to buffer their data internally and add it to a file a whole block's worth at a time. In fact, most programs already used library routines to buffer their output (to cut down on the number of system calls), so all they had to do was to modify those library routines to use a larger buffer size. This approach has been adopted by many modern variants of Unix. The Solaris system you are using for this course uses 8k blocks and 1K fragments.

As disks get cheaper and CPU's get faster, wasted space is less of a problem and the speed mismatch between the CPU and the disk gets worse. Thus the trend is towards larger and larger disk blocks.

At first glance it would appear that the OS designer has no say in how big a block is. Any particular disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use larger “blocks”. For example, if we think it would be a good idea to use 2K blocks, we can group together each run of four consecutive sectors and call it a block. In fact, it would even be possible to use variable-sized “blocks,” so long as each one is a multiple of the sector size. A variable-sized “block” is called an extent. When extents are used, they are usually used in addition to multi-sector blocks. For example, a system may use 2k blocks, each consisting of 4 consecutive sectors, and then group them into extents of 1 to 10 blocks. When a file is opened for writing, it grows by adding an extent at a time. When it is closed, the unused blocks at the end of the last extent are returned to the system. The problem with extents is that they introduce all the problems of external fragmentation that we saw in the context of main memory allocation. Extents are generally only used in systems such as databases, where high-speed access to very large files is important.

Free Space

[ Silberschatz, Galvin, and Gagne, Section 11.7 ]

We have seen how to keep track of the blocks in each file. How do we keep track of the free blocks--blocks that are not in any file? There are two basic approaches.

Use a bit vector. That is simply an array of bits with one bit for each block on the disk. A 1 bit indicates that the corresponding block is allocated (in some file) and a 0 bit says that it is free. To allocate a block, search the bit vector for a zero bit, and set it to one.
Use a free list. The simplest approach is simply to link together the free blocks by storing the block number of each free block in the previous free block. The problem with this approach is that when a block on the free list is allocated, you have to read it into memory to get the block number of the next block in the list. This problem can be solved by storing the block numbers of additional free blocks in each block on the list. In other words, the free blocks are stored in a sort of lopsided tree on disk. If, for example, 128 block numbers fit in a block, 1/128 of the free blocks would be linked into a list. Each block on the list would contain a pointer to the next block on the list, as well as pointers to 127 additional free blocks. When the first block of the list is allocated to a file, it has to be read into memory to get the block numbers stored in it, but then we and allocate 127 more blocks without reading any of them from disk. Freeing blocks is done by running this algorithm in reverse: Keep a cache of 127 block numbers in memory. When a block is freed, add its block number to this cache. If the cache is full when a block is freed, use the block being freed to hold all the block numbers in the cache and link it to the head of the free list by adding to it the block number of the previous head of the list.

How do these methods compare? Neither requires significant space overhead on disk. The bitmap approach needs one bit for each block. Even for a tiny block size of 512 bytes, each bit of the bitmap describes 512*8 = 4096 bits of free space, so the overhead is less than 1/40 of 1%. The free list is even better. All the pointers are stored in blocks that are free anyhow, so there is no space overhead (except for one pointer to the head of the list). Another way of looking at this is that when the disk is full (which is the only time we should be worried about space overhead!) the free list is empty, so it takes up no space. The real advantage of bitmaps over free lists is that they give the space allocator more control over which block is allocated to which file. Since the blocks of a file are generally accessed together, we would like them to be near each other on disk. To ensure this clustering, when we add a block to a file we would like to choose a free block that is near the other blocks of a file. With a bitmap, we can search the bitmap for an appropriate block. With a free list, we would have to search the free list on disk, which is clearly impractical. Of course, to search the bitmap, we have to have it all in memory, but since the bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the entire bitmap in memory all the time. To do the comparable operation with a free list, we would need to keep the block numbers of all free blocks in memory. If a block number is four bytes (32 bits), that means that 32 times as much memory would be needed for the free list as for a bitmap. For a concrete example, consider a 2 gigabyte disk with 8K blocks and 4-byte block numbers. The disk contains 2³¹/2¹³ = 2¹⁸ = 262,144 blocks. If they are all free, the free list has 262,144 entries, so it would take one megabyte of memory to keep them all in memory at once. By contrast, a bitmap requires 2¹⁸ bits, or 2¹⁵ = 32K bytes (just four blocks). (On the other hand, the bit map takes the same amount of memory regardless of the number of blocks that are free).

Reliability

Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile memory. There are several techniques that can be used to mitigate the effects of these failures. We only have room for a brief survey.

Bad-block Forwarding

When the disk drive writes a block of data, it also writes a checksum, a small number of additional bits whose value is some function of the “user data” in the block. When the block is read back in, the checksum is also read and compared with the data. If either the data or checksum were corrupted, it is extremely unlikely that the checksum comparison will succeed. Thus the disk drive itself has a way of discovering bad blocks with extremely high probability.

The hardware is also responsible for recovering from bad blocks. Modern disk drives do automatic bad-block forwarding. The disk drive or controller is responsible for mapping block numbers to absolute locations on the disk (cylinder, track, and sector). It holds a little bit of space in reserve, not mapping any block numbers to this space. When a bad block is discovered, the disk allocates one of these reserved blocks and maps the block number of the bad block to the replacement block. All references to this block number access the replacement block instead of the bad block. There are two problems with this scheme. First, when a block goes bad, the data in it is lost. In practice, blocks tend to be bad from the beginning, because of small defects in the surface coating of the disk platters. There is usually a stand-alone formatting program that tests all the blocks on the disk and sets up forwarding entries for those that fail. Thus the bad blocks never get used in the first place. The main reason for the forwarding is that it is just too hard (expensive) to create a disk with no defects. It is much more economical to manufacture a “pretty good” disk and then use bad-block forwarding to work around the few bad blocks. The other problem is that forwarding interferes with the OS's attempts to lay out files optimally. The OS may think it is doing a good job by assigning consecutive blocks of a file to consecutive block numbers, but if one of those blocks is forwarded, it may be very far away for the others. In practice, this is not much of a problem since a disk typically has only a handful of forwarded sectors out of millions.

The software can also help avoid bad blocks by simply leaving them out of the free list (or marking them as allocated in the allocation bitmap).

Back-up Dumps

[ Silberschatz, Galvin, and Gagne, Section 11.10.2 ]

There are a variety of storage media that are much cheaper than (hard) disks but are also much slower. An example is 8 millimeter video tape. A “two-hour” tape costs just a few dollars and can hold two gigabytes of data. By contrast, a 2GB hard drive currently casts several hundred dollars. On the other hand, while worst-case access time to a hard drive is a few tens of milliseconds, rewinding or fast-forwarding a tape to desired location can take several minutes. One way to use tapes is to make periodic back up dumps. Dumps are really used for two different purposes:

To recover lost files. Files can be lost or damaged by hardware failures, but far more often they are lost through software bugs or human error (accidentally deleting the wrong file). If the file is saved on tape, it can be restored.
To recover from catastrophic failures. An entire disk drive can fail, or the whole computer can be stolen, or the building can burn down. If the contents of the disk have been saved to tape, the data can be restored (to a repaired or replacement disk). All that is lost is the work that was done since the information was dumped.

Corresponding to these two ways of using dumps, there are two ways of doing dumps. A physical dump simply copies all of the blocks of the disk, in order, to tape. It's very fast, both for doing the dump and for recovering a whole disk, but it makes it extremely slow to recover any one file. The blocks of the file are likely to be scattered all over the tape, and while seeks on disk can take tens of milliseconds, seeks on tape can take tens or hundreds of seconds. The other approach is a logical dump, which copies each file sequentially. A logical dump makes it easy to restore individual files. It is even easier to restore files if the directories are dumped separately at the beginning of the tape, or if the name(s) of each file are written to the tape along with the file.

The problem with logical dumping is that it is very slow. Dumps are usually done much more frequently than restores. For example, you might dump your disk every night for three years before something goes wrong and you need to do a restore. An important trick that can be used with logical dumps is to only dump files that have changed recently. An incremental dump saves only those files that have been modified since a particular date and time. Fortunately, most file systems record the time each file was last modified. If you do a backup each night, you can save only those files that have changed since the last backup. Every once in a while (say once a month), you can do a full backup of all files. In Unix jargon, a full backup is called an epoch (pronounced “eepock”) dump, because it dumps everything that has changed since “the epoch”--January 1, 1970, which is the the earliest possible date in Unix.³

The Computer Sciences department currently does backup dumps on about 260 GB of disk space. Epoch dumps are done once every 14 days, with the timing on different file systems staggered so that about 1/14 of the data is dumped each night. Daily incremental dumps save about 6-10% of the data on each file system.

Incremental dumps go fast because they dump only a small fraction of the files, and they don't take up a lot of tape. However, they introduce new problems:

If you want to restore a particular file, you need to know when it was last modified so that you know which dump tape to look at.
If you want to restore the whole disk (to recover from a catastrophic failure), you have to restore from the last epoch dump, and then from every incremental dump since then, in order. A file that is modified every day will appear on every tape. Each restore will overwrite the file with a newer version. When you're done, everything will be up-to-date as of the last dump, but the whole process can be extremely slow (and labor-intensive).
You have to keep around all the incremental tapes since the last epoch. Tapes are cheap, but they're not free, and storing them can be a hassle.

The First problem can be solved by keeping a directory of what was dumped when. A bunch of UW alumni (the same guys that invented NFS) have made themselves millionaires by marketing software to do this. The other problems can be solved by a clever trick. Each dump is assigned a positive integer level. A level n dump is an incremental dump that dumps all files that have changed since the most recent previous dump with a level greater than or equal to n. An epoch dump is considered to have infinitely high level. Levels are assigned to dumps as follows:

This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps only save files that have changed in the previous day. Level-2 dumps save files that have changed in the last two days, level-3 dumps cover four days, level-4 dumps cover 8 days, etc. Higher-level dumps will thus include more files (so they will take longer to do), but they are done infrequently. The nice thing about this scheme is that you only need to save one tape from each level, and the number of levels is the logarithm of the interval between epoch dumps. Thus even if did a dump each night and you only did an epoch dump only once a year, you would need only nine levels (hence nine tapes). That also means that a full restore needs at worst one restore from each of nine tapes (rather than 365 tapes!). To figure out what tapes you need to restore from if your disk is destroyed after dump number n, express n in binary, and number the bits from right to left, starting with 1. The 1 bits tell you which dump tapes to use. Restore them in order of decreasing level. For example, 20 in binary is 10100, so if the disk is destroyed after the 20th dump, you only need to restore from the epoch dump and from the most recent dumps at levels 5 and 3.

Consistency Checking

[ Silberschatz, Galvin, and Gagne, Section 11.10.1 ]

Some of the information in a file system is redundant. For example, the free list could be reconstructed by checking which blocks are not in any file. Redundancy arises because the same information is represented in different forms to make different operations faster. If you want to know which blocks are in a given file, look at the inode. If you you want to know which blocks are not in any inode, use the free list. Unfortunately, various hardware and software errors can cause the data to become inconsistent. File systems often include a utility that checks for consistency and optionally attempts to repair inconsistencies. These programs are particularly handy for cleaning up the disks after a crash.

Unix has a utility called fscheck. It has two principal tasks. First, it checks that blocks are properly allocated. Each inode is supposed to be the root of a tree of blocks, the free list is supposed to be a tree of blocks, and each block is supposed to appear in exactly one of these trees. Fscheck runs through all the inodes, checking each allocated inode for reasonable values, and walking through the tree of blocks rooted at the inode. It maintains a bit vector to record which blocks have been encountered. If block is encountered that has already been seen, there is a problem: Either it occurred twice in the same file (in which case it isn't a tree), or it occurred in two different files. A reasonable recovery would be to allocate a new block, copy the contents of the problem block into it, and substitute the copy for the problem block in one of the two places where it occurs. It would also be a good idea to log an error message so that a human being can check up later to see what's wrong. After all the files are scanned, any block that hasn't been found should be on the free list. It would be possible to scan the free list in a similar manner, but it's probably easier just to rebuild the free list from the set of blocks that were not found in any file. If a bitmap instead of a free list is used, this step is even easier: Simply overwrite the file system's bitmap with the bitmap constructed during the scan.

The other main consistency requirement concerns the directory structure. The set of directories is supposed to be a tree, and each inode is supposed to have a link count that indicates how many times it appears in directories. The tree structure could be checked by a recursive walk through the directories,but it is more efficient to combine this check with the walk through the inodes that checks for disk blocks, but recording, for each directory inode encountered, the inumber of its parent. The set of directories is a tree if and only if and only if every directory other than the root has a unique parent. This pass can also rebuild the link count for each inode by maintaining in memory an array with one slot for each inumber. Each time the inumber is found in a directory, increment the corresponding element of the array. The resulting counts should match the link counts in the inodes. If not, correct the counts in the inodes.

This illustrates a very important principal that pops up throughout operating system implementation (indeed, throughout any large software system): the doctrine of hints and absolutes. Whenever the same fact is recorded in two different ways, one of them should be considered the absolute truth, and the other should be considered a hint. Hints are handy because they allow some operations to be done much more quickly that they could if only the absolute information was available. But if the hint and the absolute do not agree, the hint can be rebuilt from the absolutes. In a well-engineered system, there should be some way to verify a hint whenever it is used. Unix is a bit lax about this. The link count is a hint (the absolute information is a count of the number of times the inumber appears in directories), but Unix treats it like an absolute during normal operation. As a result, a small error can snowball into completely trashing the file system.

For another example of hints, each allocated block could have a header containing the inumber of the file containing it and its offset in the file. There are systems that do this (Unix isn't one of them). The tree of blocks rooted at an inode then becomes a hint, providing an efficient way of finding a block, but when the block is found, its header could be checked. Any inconsistency would then be caught immediately, and the inode structures could be rebuilt from the information in the block headers.

By the way, if the link count calculated by the scan is zero (i.e., the inode, although marked as allocated, does not appear in any directory), it would not be prudent to delete the file. A better recovery is to add an entry to a special lost+found directory pointing to the orphan inode, in case it contains something really valuable.

Transactions

The previous section talks about how to recover from situations that “can't happen.” How do these problems arise in the first place? Wouldn't it be better to prevent these problems rather than recover from them after the fact? Many of these problems arise, particularly after a crash, because some operation was “half-completed.” For example, suppose the system was in the middle of executing a unlink system call when the lights went out. An unlink operation involves several distinct steps:

remove an entry from a directory,
decrement a link count, and if the count goes to zero,
move all the blocks of the file to the free list, and
free the inode.

If the crash occurs between the first and second steps, the link count will be wrong. If it occurs during the third step, a block may be linked both into the file and the free list, or neither, depending on the details of how the code is written. And so on...

To deal with this kind of problem in a general way, transactions were invented. Transactions were first developed in the context of database management systems, and are used heavily there, so there is a tradition of thinking of them as “database stuff” and teaching about them only in database courses and text books. But they really are an operating system concept. Here's a two-bit introduction.

We have already seen a mechanism for making complex operations appear atomic. It is called a critical section. Critical sections have a property that is sometimes called synchronization atomicity. It is also called serializability because if two processes try to execute their critical sections at about the same time, the next effect will be as if they occurred in some serial order.⁴ If systems can crash (and they can!), synchronization atomicity isn't enough. We need another property, called failure atomicity, which means an “all or nothing” property: Either all of the modifications of nonvolatile storage complete or none of them do.

There are basically two ways to implement failure atomicity. They both depend on the fact that a writing a single block to disk is an atomic operation. The first approach is called logging. An append-only file called a log is maintained on disk. Each time a transaction does something to file-system data, it creates a log record describing the operation and appends it to the log. The log record contains enough information to undo the operation. For example, if the operation made a change to a disk block, the log record might contain the block number, the length and offset of the modified part of the block, and the the original content of that region. The transaction also writes a begin record when it starts, and a commit record when it is done. After a crash, a recovery process scans the log looking for transactions that started (wrote a begin record) but never finished (wrote a commit record). If such a transaction is found, its partially completed operations are undone (in reverse order) using the undo information in the log records.

Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the cached copy and only written back out to disk from time to time. If the system crashes before the changes are written to disk, the data structures on disk may be inconsistent. Logging can also be used to avoid this problem by putting into each log record redo information as well as undo information. For example, the log record for a modification of a disk block should contain both the old and new value. After a crash, if the recovery process discovers a transaction that has completed, it uses the redo information to make sure the effects of all of its operations are reflected on disk. Full recovery is always possible provided

The log records are written to disk in order,
The commit record is written to disk when the transaction completes, and
The log record describing a modification is written to disk before any of the changes made by that operation are written to disk.

This algorithm is called write-ahead logging.

The other way of implementing transactions is called shadow blocks.⁵ Suppose the data structure on disk is a tree. The basic idea is never to change any block (disk block) of the data structure in place. Whenever you want to modify a block, make a copy of it (called a shadow of it) instead, and modify the parent to point to the shadow. Of course, to make the parent point to the shadow you have to modify it, so instead you make a shadow of the parent an modify it instead. In this way, you shadow not only each block you really wanted to modify, but also all the blocks on the path from it to the root. You keep the shadow of the root block in memory. At the end of the transaction, you make sure the shadow blocks are all safely written to disk and then write the shadow of the root directly onto the root block. If the system crashes before you overwrite the root block, there will be no permanent change to the tree on disk. Overwriting the root block has the effect of linking all the modified (shadow blocks) into the tree and removing all the old blocks. Crash recovery is simply a matter of garbage collection. If the crash occurs before the root was overwritten, all the shadow blocks are garbage. If it occurs after, the blocks they replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the garbage blocks (they are blocks that aren't in the tree).

Database systems almost universally use logging, and shadowing is mentioned only in passing in database texts. But the shadowing technique is used in a variant of the Unix file system called (somewhat misleadingly) the Log-structured File System (LFS). The entire file system is made into a tree by replacing the array of inodes with a tree of inodes. LFS has the added advantage (beyond reliability) that all blocks are written sequentially, so write operations are very fast. It has the disadvantage that files that are modified here and there by random access tend to have their blocks scattered about, but that pattern of access is comparatively rare, and there are techniques to cope with it when it occurs. The main source of complexity in LFS is figuring out when and how to do the “garbage collection.”

Performance

[ Silberschatz, Galvin, and Gagne, Section 11.9 ]

The main trick to improve file system performance (like anything else in computer science) is caching. The system keeps a disk cache (sometimes also called a buffer pool) of recently used disk blocks. In contrast with the page frames of virtual memory, where there were all sorts of algorithms proposed for managing the cache, management of the disk cache is pretty simple. On the whole, it is simply managed LRU (least recently used). Why is it that for paging we went to great lengths trying to come up with an algorithm that is “almost as good as LRU” while here we can simply use true LRU? The problem with implementing LRU is that some information has to be updated on every single reference. In the case of paging, references can be as frequent as every instruction, so we have to make do with whatever information hardware is willing to give us. The best we can hope for is that the paging hardware will set a bit in a page-table entry. In the case of file system disk blocks, however, each reference is the result of a system call, and adding a few extra instructions added to a system call for cache maintenance is not unreasonable.

Adding page caching to the file system implementation is actually quite simple. Somewhere in the implementation, there is probably a procedure that gets called when the system wants to access a disk block. Let's suppose the procedure simply allocates some memory space to hold the block and reads it into memory.


    Block readBlock(int blockNumber) {
        Block result = new Block();
        Disk.read(blockNumber, result);
        return result;
    }

To add caching, all we have to do is modify this code to search the disk cache first.


    class CacheEntry {
        int blockNumber;
        Block buffer;
        CacheEntry next, previous;
    }
    class DiskCache {
        CacheEntry head, tail;
        CacheEntry find(int blockNumber) {
            // Search the list for an entry with a matching block number.
            // If not found, return null.
        }
        void moveToFront(CacheEntry entry) {
            // more entry to the head of the list
        }
        CacheEntry oldest() {
            return tail;
        }
        Block readBlock(int blockNumber) {
            Block result;
            CacheEntry entry = find(blockNumber);
            if (entry == null) {
                entry = oldest();
                Disk.read(blockNumber, entry.buffer);
                entry.blockNumber = blockNumber;
            }
            moveToFront(entry);
            return entry.buffer;
        }
    }

This code is not quite right, because it ignores writes. If the oldest buffer is dirty (it has been modified since it was read from disk), it first has to be written back to the disk before it can be used to hold the new block. Most systems actually write dirty buffers back to the disk sooner than necessary to minimize the damage caused by a crash. The original version of Unix had a background process that would write all dirty buffers to disk every 30 seconds. Some information is more critical than others. Some versions of Unix, for example, write back directory blocks (the data block of directory files of type directory) as each time they are modified. This technique--keeping the block in the cache but writing its contents back to disk after any modification--is called write-through caching. (Some modern versions of Unix use techniques inspired by database transactions to minimize the effects of crashes).

LRU management automatically does the “right thing” for most disk blocks. If someone is actively manipulating the files in a directory, all of the directory's blocks will probably be in the cache. If a process is scanning a large file, all of its indirect blocks will probably be in memory most of the time. But there is one important case where LRU is not the right policy. Consider a process that is traversing (reading or writing) a file sequentially from beginning to end. Once that process has read or written the last byte of a block, it will not touch that block again. The system might as well immediately move the block to the tail of the list as soon as the read or write request completes. Tanenbaum calls this technique free behind. It is also sometimes called most recently used (MRU) to contrast it with LRU. How does the system know to handle certain blocks MRU? There are several possibilities.

If the operating system interface distinguishes between random-access files and sequential files, it is easy. Data blocks of sequential files should be managed MRU.
In some systems, all files are alike, but there is a different kind of open call, or a flag passed to open, that indicates whether the file will be accessed randomly or sequentially.
Even if the OS gets no explicit information from the application program, it can watch the pattern of reads an writes. If recent history indicates that all (or most) reads or writes of the file have been sequential, the data blocks should be managed MRU.

A similar trick is called read-ahead. If a file is being read sequentially, it is a good idea to read a few blocks at a time. This cuts down on the latency for the application (most of the time the data the application wants is in memory before it even asks for it). If the disk hardware allows multiple blocks to be read at a time, it can cut the number of disk read requests, cutting down on overhead such as the time to service a I/O completion interrupt. If the system has done a good job of clustering together the disks of the file, read-ahead also takes better advantage of the clustering. If the system reads one block at a time, another process, accessing a different file, could make the disk head move away from the area containing the blocks of this file between accesses.

The Berkeley file system introduced another trick to improve file system performance. They divided the disk into chunks, which they called cylinder groups (CGs) because each one is comprised of some number of adjacent cylinders. Each CG is like a miniature disk. It has its own super block and array of inodes. The system attempts to put all the blocks of a file in the same CG as its inode. It also tries to keep all the inodes in one directory together in the same CG so that operations like


    ls -l *

will be fast. It uses a variety to techniques to assign inodes and blocks to CGs in such as way as to distribute the free space fairly evenly between them, so there will be enough room to do this clustering. In particular,

When a new file is created, its inode is placed in the same CG as its parent directory (if possible). But when a new directory is created, its inode is placed in CG with the largest amount of free space (so that the files in the directory will be able to be near each other).
When blocks are added to a file, they are allocated (if possible) from the same CG that contains it inode. But when the size of the file crosses certain thresholds (say every megabyte or so), the system switches to a different CG, one that is relatively empty. The idea is to prevent a big file from hogging all the space in one CG and preventing other files in the CG from being well clustered.

¹This Java declaration is actually a bit of a lie. In Java, an instance of class Dirent would include some header information indicating that it was a Dirent object, a two-byte short integer, and a pointer to an array object (which contains information about its type an length, in addition to the 14 bytes of data). The actual representation is given by the C (or C++) declaration


    struct direct {
        unsigned short int inumber;
        char name[14];
    }

Unfortunately, there's no way to represent this in Java.

²This is also a lie for the reasons cited in the previous footnote as well as the fact that the field byte name[], which is intended to indicate an array of indeterminant length, rather than a pointer to an array. The actual C declaration is


    struct dirent {
        unsigned long int inumber;
        unsigned short int reclen;
        unsigned short int reclen;
        char name[256];
    }

The array size 256 is a lie. The code depends on the fact that the C language does not do any array bounds checking.

³The dictionary defines epoch as

      1 : an instant of time or a date selected as a point of 
          reference in astronomy
      2  a : an event or a time marked by an event that begins a new 
             period or development
         b : a memorable event or date

⁴Critical sections are usually implemented so that they actually occur one after the other, but all that is required is that they behave as if they were serialized. For example, if neither transaction modifies anything, or if they don't touch any overlapping data, they can be run concurrently without any harm. Database implementations of transactions go to a great deal of trouble to allow as much concurrency as possible.

⁵Actually, the technique is usually called “shadow paging” because in the context of databases, disk blocks are often called “pages.” We reserve the term “pages” for virtual memory.

Previous File Systems
Next Security
Contents

solomon@cs.wisc.edu
Tue Jan 16 14:33:41 CST 2007

CS 537Lecture Notes, Part 11More About File Systems

Contents

CS 537
Lecture Notes, Part 11
More About File Systems