** Log-structured File Systems ** In the early 90's, a group at Berkeley led by Professor John Ousterhout and graduate student Mendel Rosenblum developed a new file system known as the log-structured file system [1]. Their motivation to do so was based on the following observations: *Memory sizes were growing*: As memory got bigger, more data could be cached in memory. As more data is cached, disk traffic would increasingly consist of writes, as reads would be serviced in the cache. Thus, file system performance would largely be determined by its performance for writes. *There was a large and growing gap between random I/O performance and sequential I/O performance*: Transfer bandwidth increases roughly 50-100% every year; seek and rotational delay costs decrease much more slowly, maybe at 5-10% per year [2]. Thus, if one is able to use disks in a sequential manner, one gets a huge performance advantage, which grows over time. *Existing file systems perform poorly on many common workloads*: For example, FFS [3] would perform a large number of writes to create a new file of size one block: one for a new inode, one to update the inode bitmap, one to the directory data block that the file is in, one to the directory inode to update it, one to the new data block that is apart of the new file, and one to the data bitmap to mark the data block as allocated. Thus, although FFS would place all of these blocks within the same block group, FFS would incur many short seeks and subsequent rotational delays and thus performance would fall far short of peak sequential bandwidth. An ideal file system would thus focus on write performance, and try to make use of the sequential bandwidth of the disk. Further, it would perform well on common workloads that not only write out data but also update on-disk metadata structures frequently. The new type of file system Rosenblum and Ousterhout introduced was called *LFS*, short for the *Log-structured File System*. When writing to disk, LFS first buffers all updates to the disk (including metadata!) into a *segment*; when the segment is full, it is written to disk in one long, sequential transfer to an unused part of the disk (i.e., LFS never overwrites existing data, but rather *always* writes segments to a free part of the disk). Because segments are large, the disk is used quite efficiently, and thus performance of the file system approaches the peak performance of the disk. [WRITING TO DISK: SOME DETAILS] Let's try to understand this a little bit better through an example. Imagine we are appending a new block to a file; assume that the file already exists but currently has no blocks allocated to it (it is zero sized). To do so, LFS of course places the data block D in this in-memory segment: ------------------------------------------------------------------ | D | ------------------------------------------------------------------ However, we also must update the inode to now point to the block. Because LFS wants to make all writes sequential, it also must include the inode I in the update to disk. Thus, the segment (still in memory) now looks like this: ------------------------------------------------------------------ | D | I | ------------------------------------------------------------------ Note further that I is also updated to point to D (and also note that the pointer within I is a disk address, and thus when placing I in the segment, LFS must have an idea of where this segment will be written to disk). Assume this type of activity continues and the segment finally fills up and is written to disk. So far, so good. We have now written out I and D to disk, and the write to disk was efficient. Unfortunately, we have our first real problem: how can we find the inode I? [CRUX OF THE PROBLEM: HOW TO FIND INODES IF WE WRITE THEM ALL OVER THE DISK?] To understand how we find an inode in LFS, let us first make sure we understand how to find an inode in a typical UNIX file system. In a typical file system such as FFS, or even the old UNIX file system, finding inodes is really easy. They are organized in an array and placed on disk at a fixed location (or locations). For example, the old UNIX FS keeps all inodes at a fixed portion of the disk. Thus, given an inode number and the start address, to find a particular inode, you can calculate its exact disk address simply by multiplying the inode number by the size of an inode, and adding that to the start address of the on-disk array. Here is what this looks like on disk: Super Block | Inodes | Data blocks If we expand this a bit, and assume a single block for the super block, and that we have ten blocks for inodes, we get: Super Block | Inodes | Data blocks b0 | b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 | b11 ... Imagine we know that each inode block of size 512 bytes, and that each inode is of size 128 bytes, and let us also assume that inodes are numbered from 0 to 39. We thus get this picture, with 4 inodes (i.e., 512/128) in each block: Super Block | Inodes | Data blocks b0 | b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 | b11 ... | 0 4 8 12 16 20 24 28 32 36 | | 1 5 9 13 17 21 25 29 33 37 | | 2 6 10 14 18 22 26 30 34 38 | | 3 7 11 15 19 23 27 31 35 39 | Thus, to find inode 11, we first divide 11 by 4 and get 2 (we use integer division); thus inode 11 is in the third inode (0 is the first block of inodes). Because the inode array starts at sector address 1, we add 2 to 1 and get sector 3 (b3). Then we do 11 mod 4 to get which inode within the block (3, again starting at 0). It is that simple. Finding an inode given an inode number in FFS is only slightly more complicated; FFS splits up the array into chunks and places a group of inodes within each cylinder group. Thus, one must know how big each chunk of inodes is and the start addresses of each. After that, the calculations are similar and also easy. In LFS, life is more difficult. Why? Well, we've managed to scatter the inodes all throughout the disk! Worse, we never overwrite in place, and thus the latest version of an inode (i.e., the one we want) keeps moving. [SOLUTION: THE INODE MAP] To remedy this, the designers of LFS introduced a *level of indirection* between inode numbers and the inodes through a data structure called the *inode map* (*imap* for short). The imap is a structure that takes an inode number as input and produces the disk address of the most recent version of the inode. Thus, you can imagine it would often be implemented as a simple array, with 4 bytes (a disk pointer) per entry. Any time an inode is written to disk, the imap is updated with its new location. [ANOTHER PROBLEM: WHERE TO PUT THE INDOE MAP?] The imap, unfortunately, needs to be kept persistent (i.e., written to disk); doing so allows LFS to keep track of the locations of inodes across crashes, and thus operate as desired. Thus, a question: where should the imap live? It could live on a fixed part of the disk, of course. Unfortunately, as it gets updated frequently, this would then require the segment writes to be followed by writes to the imap, and hence performance would suffer (i.e., there would be more disk seeks, between each segment and the fixed location of the imap). Thus, LFS places pieces of the inode map into the current segment as well. Thus, when writing a data block to disk (as above), we might actually see: ------------------------------------------------------------------ | D | I | imap(I) | ------------------------------------------------------------------ where imap(I) is the piece of the inode map that tells us where inode I is on disk. Note that imap(I) will also include the mapping information for some other inodes that are near inode I in the imap. The clever reader might have noticed a problem here. How do we find the inode map, now that pieces of it are also now spread across the disk? LFS finally keeps a fixed place on disk for this, and it is known as the *checkpoint region*. The checkpoint region contains pointers to the latest pieces of the inode map, and thus the inode map pieces can be found. Note the checkpoint region is only updated periodically (say every 30 seconds or so), and thus performance is not ill-affected. [READING A FILE FROM DISK] To make sure you understand what is going on, let's now read a file from disk. Assume we have nothing in memory to begin. The first on-disk data structure we must read is the checkpoint region. The checkpoint region contains pointers (i.e., disk addresses) to the entire inode map, and thus LFS then reads in the entire inode map and caches it in memory. After this point, when given an inode number of a file, LFS simply looks up the inode-number to inode-disk-address mapping in the imap, and reads in the most recent version of the inode. To read a block from the file, at this point, LFS proceeds exactly as a typical UNIX file system, by using direct pointers or indirect pointers or doubly-indirect pointers as need be. Thus, in the common case, LFS should perform the same number of I/Os as a typical file system when reading a file from disk; the entire imap is cached and thus the only extra work LFS does during file read is to look up the address of the inode in the imap. [A NEW PROBLEM: GARBAGE COLLECTION] You may have noticed another problem with LFS; it keeps writing newer version of a file, its inode, and in fact all data to new parts of the disk. This process, while keeping writes efficient, implies that LFS leaves older versions of a file all over the disk, scattered throughout a number of older segments. One could keep those older versions around and allow users to restore old file versions (for example, when they accidentally overwrite or delete a file, it could be quite handy to do so); such a file system is known as a *versioning* file system because it keeps track of the different versions of a file. However, LFS instead keeps only the latest live version of a file; thus (in the background), LFS must periodically find these old dead versions of file data, inodes, etc., and *clean* them; cleaning should thus make blocks on disk free again for use in a subsequent segment write. Note that the process of cleaning is a form of *garbage collection*, a similar method that arises in languages that automatically free unused memory for programs. The basic LFS cleaning process works as follows. Periodically, the LFS cleaner must read in a number of old (partially-used) segments, determine which blocks are live within the segment, and then write out a new set of segments with just the live blocks within them. Specifically, we expect the cleaner to read in M existing segments, compact their contents into N new segments (where N