** The Fast File System **

When UNIX was first introduced, the UNIX wizard himself Ken Thompson wrote the
first file system. We will call that the "old UNIX file system", and it was
really simple. Basically, it looked like this on the disk:

Super block | Inodes | Data blocks

The super block contained information about the entire file system: how big the 
volume is, how many inodes there are, a pointer to the head of the free list
of blocks, and so forth. The inode region of the disk contained all the inodes
for the file system. Finally, most of the disk was taken up by data blocks.

The good thing about the old file system was that it was simple, and supported
the basic abstractions the file system was trying to deliver: files and the
directory hierarchy.

[THE CRUX OF THE PROBLEM]

The problem: performance was terrible. As measured by Bill Joy and his
colleagues at Berkeley [1], performance started off bad and got worse over
time, to the point where the file system was delivering only 2% of overall
disk bandwidth!

The main issue was that the old UNIX FS treated the disk like it was a memory;
data was spread all over the place without regard to the fact that the medium
holding the data was a disk, and thus had real and expensive positioning
costs. For example, the data blocks of a file were often very far away from
its inode, thus inducing an expensive seek whenever one first read the inode
and then the data blocks of a file (a pretty common operation). 

Worse, the file system would end up getting quite *fragmented*, as the free
space was not carefully managed. The free list would end up pointing to a
bunch of blocks spread across the disk, and as files got allocated, they would
simply take the next free block. The result was that a logically contiguous
file would be accessed by going back and forth across the disk, thus reducing
performance dramatically.

For example, imagine the following data block region, which contains four
files (A, B, C, and D), each of size 2 blocks:

A1 A2 B1 B2 C1 C2 D1 D2

Now, if B and D are deleted, we will have:

A1 A2 free free C1 C2 free free

As you can see, the free space is fragmented into two chunks of two blocks,
instead of one nice contiguous chunk of four. Let's say we now wish to
allocate a file E, of size 4 blocks: 

A1 A2 E1 E2 C1 C2 E3 E4

You can see what happens: E gets spread across the disk, and as a result, when
accessing E, you don't get peak (sequential) performance from the
disk. Rather, you first read E1 and E2, then seek, then read E3 and E4. This
fragmentation problem happened all the time in the old UNIX file system, and
it hurt performance. (A side note: this problem is exactly what disk
defragmentation tools help with; they will reorganize on-disk data to place
files contiguously and make free space one or a few contiguous regions, moving
data around and then rewriting inodes and such to reflect the changes)

One other problem: the original block size was too small (512 bytes). Thus,
transferring data from the disk was inherently inefficient. Smaller blocks
were good because they minimized *internal fragmentation* (waste within the
block), but bad for transfer as each block might require a positioning
overhead to reach it.

[FFS: A SOLUTION]

A group at Berkeley decided to build a better, faster file system, which they
cleverly called the "Fast File System" (FFS). The idea was to design the file
system structures and allocation policies to be "disk aware" and thus improve
performance, which is exactly what they did.

[STRUCTURES: THE CYLINDER GROUP]

The first step was to change the on-disk structures. FFS divides the disk into
a bunch of groups known as *cylinder groups* (some modern file systems like
Linux ext2 and ext3 just call them *block groups*). We can thus imagine a
disk with 8 cylinder groups as follows:

  Group0 | Group1 | Group2 | Group3 | Group4 | Group5 | Group6 | Group7

Each group looks like this:

  Super Block | Inode Bitmap | Data Bitmap | Inodes | Data blocks

We now describe the components of a cylinder group. A copy of the super block
is found in each group for reliability reasons (e.g., if one gets corrupted or
scratched, you can still mount and access the file system by using one of the
others). 

The inode and data bitmaps track whether each inode or data block is free,
respectively. Bitmaps are an excellent way to manage free space in a file
system because it is easy to find a large chunk of free space and allocate it
to a file, perhaps avoiding some of the fragmentation problems of the free
list in the old UNIX file system.

Finally, the inode and data block regions are just like in the previous file
system. 

[POLICIES: ALLOCATING DIRECTORIES, FILES]

With this group structure in place, FFS now has to decide how to place files
and directories and associated metadata on disk to improve performance. The
basic mantra is simple: *keep related stuff together* (and its corollary, keep
unrelated stuff far apart). 

Thus, to obey the mantra, FFS has to decide what is "related" and place it
within the same block group; conversely, unrelated items should be placed into
different block groups. To achieve this end, FFS makes use of a few simple
placement heuristics. 

The first is the placement of directories. FFS employs a simple approach: find
the cylinder group with a low number of allocated directories (because we want
to balance directories across groups) and a high number of free inodes
(because we want to subsequently be able to allocate a bunch of files), and
put the directory data and inode in that group. Of course, other heuristics
could be used here (e.g., taking into account the number of free data blocks).

For files, FFS does two things. First, it makes sure (in the general case) to
allocate the data blocks of an file in the same group as its inode, thus
preventing long seeks between inode and data (as in the old file system). 
Second, it places all files that are in the same directory in the cylinder
group of the directory they are in. Thus, if a user creates four files,
"/dir1/1.txt", "/dir1/2.txt", "/dir1/3.txt", and "/dir99/4.txt", FFS would try
to place the first three near one another (same group) and the fourth far awy
(in some other group).

It should be noted that these heuristics are not based on extensive studies of
file system traffic or anything particularly nuanced; rather, they are based
on good old-fashioned common sense (isn't that what CS stands for after
all?). Files in a directory *are* often accessed together (imagine compiling a
bunch of files and then linking them into a single executable).  Because they
are, FFS will often improve performance, making sure that seeks between
related files are short.

[THE LARGE-FILE EXCEPTION]

There is one important exception to the general policy of file placement, and
it happens for large files. Without a different rule, this large file would
entirely fill its first block group (and maybe others). Filling a block group
in this manner is undesirable for the following reason: it prevents subsequent
"related" files from being placed within this block group, and thus may hurt
subsequent file-access locality.

Thus, for large files, FFS does the following. After some some number of
blocks are allocated into the first block group (e.g., 12 blocks), FFS places
the next "large" chunk of the file in another block group (perhaps chosen for
its low utilization). Then, the next chunk of the file is placed in yet
another different block group, and so on. Thus, instead of a single large file
(with chunks 0 through 7) filling up a block group like this:

  Group0 | Group1 | Group2 | Group3 | Group4 | Group5 | Group6 | Group7
                    0123457

Instead we get a file F spread across the disk in chunks:

  Group0 | Group1 | Group2 | Group3 | Group4 | Group5 | Group6 | Group7
    67                01                23                45

The astute reader will note that this might hurt performance, particularly in
the relatively common case of sequential file access (e.g., when a user or
application reads blocks 0 through 7 in order). And you are right! It
will. But we can help this a little bit, by choosing our chunk size
carefully. 

Specifically, if the chunk size is large enough, we will still spend most of
our time transferring data from disk and just a relatively little time seeking
between chunks of the block. This process of reducing an overhead by doing
more work per overhead paid is called *amortization* and is a common technique
in computer systems.

Let's do an example: assume that the average seek time for a disk is 10
ms. Assume further that the disk transfers data at 40 MB/s. If our goal was to
spend half our time seeking between chunks and half our time transfering data
(and thus achieve 50% of peak disk performance), we would thus need to spend
10 ms transfering data for every 10 ms seek. So the question becomes: how big
does a chunk have to be in order to spend 10 ms in transfer? Easy, just use
our old friend, math:
  
    40 MB     1 second      
  -------- * ---------- * 10 ms = 409 KB 
   second      1000 ms  

(hint: remember that dividing a MB by 1000 does not get you a KB exactly. That
is why you get 409 KB and not 400 exactly).

Basically, what this equation says is this: if you transfer data at 40 MB/s,
you need to transfer only 409 KB every time you seek in order to spend half
your time seeking and half your time transferring. Similarly, you can compute 
the size of the chunk you would need to achieve 90% of peak bandwidth (turns
out it is about 3.69 MB), or even 99% of peak bandwidth (40.6 MB!). As you can
see, the closer you want to get to peak, the bigger these chunks get. Much
bigger in fact! And it gets worse as transfer rates go up and seek times only
get better slowly over time. See if you can figure out why.

[A FEW OTHER THINGS ABOUT FFS]

FFS did a couple of other neat things too. In particular, the designers were
also extremely worried about accomodating small files; as it turned out, many
files were 2 KB or so in size back then, and using 4-KB blocks, while good for
transfering data, was not so good for space efficiency. Internal fragmentation
could thus lead to roughly half the disk being wasted for a typical file
system. 

The solution the FFS designers hit upon was simple and solved the problem
quite nicely. They decided to introduce "subblocks", which were 512-byte
little blocks that the file system could allocate to files. Thus, if you
created a small file (say 1 KB in size), it would occupy two subblocks and
thus not waste an entire 4-KB block. As the file grew, the file system will
continue allocating 512-byte blocks to it until it acquires a full 4-KB of
data. At that point, FFS will find a 4-KB block, copy the subblocks into them,
and free the subblocks for future use.

You might note that this is inefficient, requiring a lot of extra work
for the file system. And you'd be right again! Thus, FFS would generally avoid
this for larger files by always writing to disk in multiples of 4 KB, thus
avoiding the subblock specialization entirely.

A second neat thing that FFS introduced was a disk layout that was optimized
for performance. In that times (before SCSI), disks were much less
sophisticated and required the host CPU to control their operation in a much
more hands-on way. A problem arose in FFS when a file was placed on consecutive
sectors of the disk as follows:

                            0  1 
                         b        2
                       a            3
                       9            4
                         8        5
                            7  6

In particular, FFS would first issue a read to block 0. By the time the read
was complete, and FFS was about to issue a read to block 1, it was too late:
block 1 had rotated under the head and now the read to block 1 would incur a
full rotation. 

FFS solved this problem with a different layout:

                            0  6 
                         b        1
                       5            7
                       a            2
                         4        8
                            9  3

By skipping over every other block (in this example), FFS has enough time to
request the next block before it went past the disk head. In fact, FFS was
smart enough to figure out for a particular disk how many blocks it should
skip in doing layout in order to avoid the extra rotations; this was called
"parameterization", as FFS would figure out the disks' performance parameters
and use those to decide on the exact staggered layout scheme.

Now you might be thinking: this isn't so great after all. In fact, you will
only get 50% of peak bandwidth with this type of layout (because you have to
go around each track twice just to read it once). Because of this, modern
disks are much smarter: they internally read the entire track in and buffer it
in an internal disk cache (often called a "track buffer" for this very
reason). Then, on subsequent reads to the track, the disk will just return the
desired data from its cache. File systems thus no longer have to worry about
these incredibly low-level details. Abstraction and higher-level interfaces
can be a good thing!


[1] Marshall K. McKusick, William N. Joy, Sam J. Leffler, Robert S. Fabry
    "A Fast File System for UNIX"
    ACM Transactions on Computing Systems.
    August, 1984. Volume 2, Number 3.
    pages 181-197.