** The Fast File System ** When UNIX was first introduced, the UNIX wizard himself Ken Thompson wrote the first file system. We will call that the "old UNIX file system", and it was really simple. Basically, it looked like this on the disk: Super block | Inodes | Data blocks The super block contained information about the entire file system: how big the volume is, how many inodes there are, a pointer to the head of the free list of blocks, and so forth. The inode region of the disk contained all the inodes for the file system. Finally, most of the disk was taken up by data blocks. The good thing about the old file system was that it was simple, and supported the basic abstractions the file system was trying to deliver: files and the directory hierarchy. [THE CRUX OF THE PROBLEM] The problem: performance was terrible. As measured by Bill Joy and his colleagues at Berkeley [1], performance started off bad and got worse over time, to the point where the file system was delivering only 2% of overall disk bandwidth! The main issue was that the old UNIX FS treated the disk like it was a memory; data was spread all over the place without regard to the fact that the medium holding the data was a disk, and thus had real and expensive positioning costs. For example, the data blocks of a file were often very far away from its inode, thus inducing an expensive seek whenever one first read the inode and then the data blocks of a file (a pretty common operation). Worse, the file system would end up getting quite *fragmented*, as the free space was not carefully managed. The free list would end up pointing to a bunch of blocks spread across the disk, and as files got allocated, they would simply take the next free block. The result was that a logically contiguous file would be accessed by going back and forth across the disk, thus reducing performance dramatically. For example, imagine the following data block region, which contains four files (A, B, C, and D), each of size 2 blocks: A1 A2 B1 B2 C1 C2 D1 D2 Now, if B and D are deleted, we will have: A1 A2 free free C1 C2 free free As you can see, the free space is fragmented into two chunks of two blocks, instead of one nice contiguous chunk of four. Let's say we now wish to allocate a file E, of size 4 blocks: A1 A2 E1 E2 C1 C2 E3 E4 You can see what happens: E gets spread across the disk, and as a result, when accessing E, you don't get peak (sequential) performance from the disk. Rather, you first read E1 and E2, then seek, then read E3 and E4. This fragmentation problem happened all the time in the old UNIX file system, and it hurt performance. (A side note: this problem is exactly what disk defragmentation tools help with; they will reorganize on-disk data to place files contiguously and make free space one or a few contiguous regions, moving data around and then rewriting inodes and such to reflect the changes) One other problem: the original block size was too small (512 bytes). Thus, transferring data from the disk was inherently inefficient. Smaller blocks were good because they minimized *internal fragmentation* (waste within the block), but bad for transfer as each block might require a positioning overhead to reach it. [FFS: A SOLUTION] A group at Berkeley decided to build a better, faster file system, which they cleverly called the "Fast File System" (FFS). The idea was to design the file system structures and allocation policies to be "disk aware" and thus improve performance, which is exactly what they did. [STRUCTURES: THE CYLINDER GROUP] The first step was to change the on-disk structures. FFS divides the disk into a bunch of groups known as *cylinder groups* (some modern file systems like Linux ext2 and ext3 just call them *block groups*). We can thus imagine a disk with 8 cylinder groups as follows: Group0 | Group1 | Group2 | Group3 | Group4 | Group5 | Group6 | Group7 Each group looks like this: Super Block | Inode Bitmap | Data Bitmap | Inodes | Data blocks We now describe the components of a cylinder group. A copy of the super block is found in each group for reliability reasons (e.g., if one gets corrupted or scratched, you can still mount and access the file system by using one of the others). The inode and data bitmaps track whether each inode or data block is free, respectively. Bitmaps are an excellent way to manage free space in a file system because it is easy to find a large chunk of free space and allocate it to a file, perhaps avoiding some of the fragmentation problems of the free list in the old UNIX file system. Finally, the inode and data block regions are just like in the previous file system. [POLICIES: ALLOCATING DIRECTORIES, FILES] With this group structure in place, FFS now has to decide how to place files and directories and associated metadata on disk to improve performance. The basic mantra is simple: *keep related stuff together* (and its corollary, keep unrelated stuff far apart). Thus, to obey the mantra, FFS has to decide what is "related" and place it within the same block group; conversely, unrelated items should be placed into different block groups. To achieve this end, FFS makes use of a few simple placement heuristics. The first is the placement of directories. FFS employs a simple approach: find the cylinder group with a low number of allocated directories (because we want to balance directories across groups) and a high number of free inodes (because we want to subsequently be able to allocate a bunch of files), and put the directory data and inode in that group. Of course, other heuristics could be used here (e.g., taking into account the number of free data blocks). For files, FFS does two things. First, it makes sure (in the general case) to allocate the data blocks of an file in the same group as its inode, thus preventing long seeks between inode and data (as in the old file system). Second, it places all files that are in the same directory in the cylinder group of the directory they are in. Thus, if a user creates four files, "/dir1/1.txt", "/dir1/2.txt", "/dir1/3.txt", and "/dir99/4.txt", FFS would try to place the first three near one another (same group) and the fourth far awy (in some other group). It should be noted that these heuristics are not based on extensive studies of file system traffic or anything particularly nuanced; rather, they are based on good old-fashioned common sense (isn't that what CS stands for after all?). Files in a directory *are* often accessed together (imagine compiling a bunch of files and then linking them into a single executable). Because they are, FFS will often improve performance, making sure that seeks between related files are short. [THE LARGE-FILE EXCEPTION] There is one important exception to the general policy of file placement, and it happens for large files. Without a different rule, this large file would entirely fill its first block group (and maybe others). Filling a block group in this manner is undesirable for the following reason: it prevents subsequent "related" files from being placed within this block group, and thus may hurt subsequent file-access locality. Thus, for large files, FFS does the following. After some some number of blocks are allocated into the first block group (e.g., 12 blocks), FFS places the next "large" chunk of the file in another block group (perhaps chosen for its low utilization). Then, the next chunk of the file is placed in yet another different block group, and so on. Thus, instead of a single large file (with chunks 0 through 7) filling up a block group like this: Group0 | Group1 | Group2 | Group3 | Group4 | Group5 | Group6 | Group7 0123457 Instead we get a file F spread across the disk in chunks: Group0 | Group1 | Group2 | Group3 | Group4 | Group5 | Group6 | Group7 67 01 23 45 The astute reader will note that this might hurt performance, particularly in the relatively common case of sequential file access (e.g., when a user or application reads blocks 0 through 7 in order). And you are right! It will. But we can help this a little bit, by choosing our chunk size carefully. Specifically, if the chunk size is large enough, we will still spend most of our time transferring data from disk and just a relatively little time seeking between chunks of the block. This process of reducing an overhead by doing more work per overhead paid is called *amortization* and is a common technique in computer systems. Let's do an example: assume that the average seek time for a disk is 10 ms. Assume further that the disk transfers data at 40 MB/s. If our goal was to spend half our time seeking between chunks and half our time transfering data (and thus achieve 50% of peak disk performance), we would thus need to spend 10 ms transfering data for every 10 ms seek. So the question becomes: how big does a chunk have to be in order to spend 10 ms in transfer? Easy, just use our old friend, math: 40 MB 1 second -------- * ---------- * 10 ms = 409 KB second 1000 ms (hint: remember that dividing a MB by 1000 does not get you a KB exactly. That is why you get 409 KB and not 400 exactly). Basically, what this equation says is this: if you transfer data at 40 MB/s, you need to transfer only 409 KB every time you seek in order to spend half your time seeking and half your time transferring. Similarly, you can compute the size of the chunk you would need to achieve 90% of peak bandwidth (turns out it is about 3.69 MB), or even 99% of peak bandwidth (40.6 MB!). As you can see, the closer you want to get to peak, the bigger these chunks get. Much bigger in fact! And it gets worse as transfer rates go up and seek times only get better slowly over time. See if you can figure out why. [A FEW OTHER THINGS ABOUT FFS] FFS did a couple of other neat things too. In particular, the designers were also extremely worried about accomodating small files; as it turned out, many files were 2 KB or so in size back then, and using 4-KB blocks, while good for transfering data, was not so good for space efficiency. Internal fragmentation could thus lead to roughly half the disk being wasted for a typical file system. The solution the FFS designers hit upon was simple and solved the problem quite nicely. They decided to introduce "subblocks", which were 512-byte little blocks that the file system could allocate to files. Thus, if you created a small file (say 1 KB in size), it would occupy two subblocks and thus not waste an entire 4-KB block. As the file grew, the file system will continue allocating 512-byte blocks to it until it acquires a full 4-KB of data. At that point, FFS will find a 4-KB block, copy the subblocks into them, and free the subblocks for future use. You might note that this is inefficient, requiring a lot of extra work for the file system. And you'd be right again! Thus, FFS would generally avoid this for larger files by always writing to disk in multiples of 4 KB, thus avoiding the subblock specialization entirely. A second neat thing that FFS introduced was a disk layout that was optimized for performance. In that times (before SCSI), disks were much less sophisticated and required the host CPU to control their operation in a much more hands-on way. A problem arose in FFS when a file was placed on consecutive sectors of the disk as follows: 0 1 b 2 a 3 9 4 8 5 7 6 In particular, FFS would first issue a read to block 0. By the time the read was complete, and FFS was about to issue a read to block 1, it was too late: block 1 had rotated under the head and now the read to block 1 would incur a full rotation. FFS solved this problem with a different layout: 0 6 b 1 5 7 a 2 4 8 9 3 By skipping over every other block (in this example), FFS has enough time to request the next block before it went past the disk head. In fact, FFS was smart enough to figure out for a particular disk how many blocks it should skip in doing layout in order to avoid the extra rotations; this was called "parameterization", as FFS would figure out the disks' performance parameters and use those to decide on the exact staggered layout scheme. Now you might be thinking: this isn't so great after all. In fact, you will only get 50% of peak bandwidth with this type of layout (because you have to go around each track twice just to read it once). Because of this, modern disks are much smarter: they internally read the entire track in and buffer it in an internal disk cache (often called a "track buffer" for this very reason). Then, on subsequent reads to the track, the disk will just return the desired data from its cache. File systems thus no longer have to worry about these incredibly low-level details. Abstraction and higher-level interfaces can be a good thing! [1] Marshall K. McKusick, William N. Joy, Sam J. Leffler, Robert S. Fabry "A Fast File System for UNIX" ACM Transactions on Computing Systems. August, 1984. Volume 2, Number 3. pages 181-197.