** Journaling ** Let's say we are trying to append a block to an existing file. For simplicity, let's assume we are using an FFS-like file system. Before we do this write, the file is on disk in the form of an inode, one (or more) existing data blocks, and some bitmaps that mark the inode and data blocks as in-use. This might look something like this (on a tiny file system): inode bitmap || data bitmap || inodes || data blocks 0 1 0 0 0 0 || 0 0 0 0 1 0 || FREE [Iv1] FREE FREE FREE FREE || FREE FREE FREE FREE [D1] FREE Inside the first version of the inode (Iv1), it looks like this: owner : remzi permissions : read-only size : 1 pointer : 4 pointer : null pointer : null pointer : null In this inode, the 'size' of the file is '1' (it has one block allocated), the first direct pointer points to block 4 (the first data block of the file, D1), and all three other direct pointers are set to null (indicating that they are not used). Of course, real inodes have many more fields. Inside the data bitmap (B1), we have a bit indicating the data block 4 is in use. And finally, of course, we see that disk block 4 holds the contents of the first block of the file, D1. When we append to the file, we are adding a new data block to it, and thus must update three on-disk structures: the inode (which now must contain a pointer to the new block as well as an updated size count to reflect the new size of the file), the new data block D2, and a new version of the data bitmap to indicate that the new data block has been allocated. Thus, in the memory of the system, we have three blocks which we must write to disk. The updated inode (Iv2) now looks like this: owner : remzi permissions : read-only size : 2 pointer : 4 pointer : 5 pointer : null pointer : null The updated data bitmap (B2) now looks like this: 0 0 0 0 1 1 and the data block itself (D2). What we would like is for the final on-disk image of the file system to look like this: inode bitmap || data bitmap || inodes || data blocks 0 1 0 0 0 0 || 0 0 0 0 1 1 || FREE [Iv2] FREE FREE FREE FREE || FREE FREE FREE FREE [D1] [D2] and thus we must perform three separate writes to the disk, one each for the inode (Iv2), bitmap (B2), and data block (D2). [THE CRUX OF THE PROBLEM] But now we have a problem: the system may crash or lose power between any two writes, and thus the on-disk state may only partially get updated. After the crash, the system boots and wishes to mount the file system again (in order to access files and such). But first, we must make sure the file system is in a reasonable state. [EXAMPLES] To understand this better, let's look at some example crash scenarios. Imagine only a single write succeeds. There are three possible cases: 1,1) just the data block (D2) is written to disk. In this case, the data is on disk, but there is no inode that points to it and no bitmap that even says the block is allocated. Thus, it is as if the write never occurred. 1,2) just the updated inode (Iv2) is written to disk. In this case, the inode points to the disk address (5) where D2 was about to be written, but D2 has not yet been written there. Thus, if we trust that pointer, we will read *garbage* data from the disk (the old contents of disk address 5). Further, we have a new problem, which we call a *file system inconsistency*. The on-disk bitmap is telling us that data block 5 has not been allocated, but the inode is saying that it has. This disagreement in the file system data structures is an inconsistency, and to use the file system, we must somehow fix this (more on that below). 1,3) just the updated bitmap (B2) is written to disk. In this case, the bitmap indicates that block 5 is allocated, but there is no inode that points to it. Thus the file system is *inconsistent* again, but there is little to do as we have no idea which file this block should have been a part of. There are also three more crash scenarios in this attempt to write three blocks to disk, where two writes succeed and the last one fails: 2,1) the inode (Iv2) and bitmap (B2) are written to disk, but not data (D2). In this case, the file system is completely consistent: the inode has a pointer to block 5, the bitmap indicates that 5 is in use, and thus everything looks OK from the perspective of the file system's metadata. But there is one problem: 5 has *garbage* in it again. 2,2) the inode (Iv2) and the data block (D2) are written, but not the bitmap (B2). In this case, we have the inode pointing to the correct data on disk, but again have an *inconsistency* between the inode and the old version of the bitmap (B1). Thus, we should fix that, either by removing that block from the inode (and thus losing data) or by simply trusting the inode and updating the bitmap to indicate that the data block 5 is in use. 2,3) the bitmap (B2) and data block (D2) are written, but not the inode (Iv2). In this case, we again have an *inconsistency* between the inode and the data bitmap, but there is only one way to fix it: by clearing the bitmap (because no inode points to that data block, and thus we have no idea which file D2 is a part of). Thus, even though the data block was written to disk, it is lost. [THE CONSISTENT UPDATE PROBLEM] Hopefully, from these crash scenarios, you can see the many problems that can occur to our on-disk file system image because of crashes. What we'd like to do ideally is move the file system from one consistent state (e.g., before the file got appended to) to another (e.g., after the inode, bitmap, and new data block have been written to disk), but we can't do this easily because the disk only commits one write at a time, and crashes may occur between these updates. We call this general problem the *consistent update problem*. [SOME SOLUTIONS] Early file systems took a simple approach to the file system update problem. Basically, they decided to let inconsistencies happen and then fix them later (when rebooting). A classic example of a tool that does this is fsck, a Unix tool for find such inconsistencies and repairing them. Note, such an approach can't fix all problems (e.g., example 2,1 above where the file system looks consistent but points to garbage data); the only goal is to make sure the file system metadata is internally consistent. (could have more fsck details here) However, fsck (and similar approaches) have a fundamental problem: they are *too slow*. With a very large disk volume, just reading the entire disk may take many minutes or even hours. Worse, it seems kind of crazy. Consider our example above with just a few blocks being written to the disk; what a waste to scan the entire disk just to see if one of those three writes didn't finish! Thus, as disks (and multi-disk RAID systems) grew in size, people started to look for other solutions. [JOURNALING] Probably the most popular solution to the consistent update problem is to steal an idea from the world of database management systems. That idea, known as write-ahead logging, was invented to address exactly this type of problem. In file systems, we sometimes call write-ahead logging "journaling" for historical reasons. The basic idea is as follows. When updating the disk, before over-writing the structures in place, first write down a little note (somewhere else on the disk) describing what you are about to do. Writing this little note is the "write ahead" part, and we write it to a structure we call the "log"; hence, "write-ahead logging". By writing the note to disk, you are guaranteeing that if a crash takes places during the update (overwrite) of the structures you are updating, you can go back and look at the note you made and try again; thus, you will know exactly what to fix (and how to fix it) after a crash, instead of having to scan the entire disk. [SIMPLE JOURNALING: AN EXAMPLE] Let's do a simple example. Say we have our canonical update again, where we wish to write the inode (Iv2), bitmap (B2), and data block (D2) to disk again. Before writing them to their final disk locations, we are now first going to write them to the log (a.k.a. journal). This is what this will look like in the log: TxBegin | Iv2 | B2 | D2 | TxEnd You can see we have written five blocks here. The transaction begin tells us about this update, including information about the pending update to the file system and some kind of transaction identifier (TID). The middle three blocks just contain the exact contents of the blocks themselves; this is known as *physical* logging as we are putting the exact physical contents of the update in the journal (an alternate idea is to put a more compact *logical* representation of the update in the journal, e.g., "this update wishes to append data block D2 to file X", which is a little more complex but can save space in the log and perhaps improve performance). The final block is a marker of the end of this transaction, and will also contain the TID. Once this transaction is safely on disk, we are ready to overwrite the old structures in the file system. Thus, we issue the writes Iv2, B2, and D2 to their disk locations as seen above. If these writes complete successfully, we are basically done (though at some point we should free those log entries so that we can use them again later). Thus, the complete sequence is as follows: 1 - Write the transaction (containing Iv2, B2, D2) to the journal 2 - Write the blocks (Iv2, B2, D2) to the file system proper 3 - Mark the transaction free in the journal Of course, a crash may happen at any time during this sequence. If it happens after the transaction is committed to disk but before (or during) the writes of Iv2, B2, and D2, we can now *recover*. When the system boots, it will scan the log and look for transactions that have committed to the disk but have not yet been freed; these transactions are thus *replayed*, with the file system again attempting to write Iv2, B2, and D2 to their final on-disk locations. This form of logging is one of the simplest forms there is, and is called *redo* logging. Again, if these writes complete, the recovery code will make this transaction as free and thus the file system will be in a consistent state. Things get a little trickier when a crash occurs during the writes to the journal. Here, we are trying to write TxBegin|Iv2|B2|D2|TxEnd to disk. One simple way to do this would be to issue each one at a time, waiting for each to complete, and then issuing the next. However, this is slow. Ideally, we'd like to issue all five block writes at once, as this would turn five writes into a single sequential write and thus be faster. However, this is unsafe, for the following reason: given such a big write, the disk internally may perform scheduling and complete small pieces of the big write in any order. Thus, the disk internally may (1) write TxBegin, Iv2, B2, and TxEnd and only later (2) write D2. Unfortunately, if the disk loses power between (1) and (2), this is what we will see on disk: TxBegin | Iv2 | B2 | ??? | TxEnd Why is this a problem? Well, the transaction looks like a valid transaction (it has a begin and an end, after all, and some stuff in between). Further, the file system can't look at that fourth block and know it is wrong; after all, it is arbitrary user data. Thus, if the system now reboots and runs recovery, it will replay this transaction, and ignorantly copy the contents of the garbage block '???' to the location where D2 is supposed to live. To avoid this, some journaling systems issue a tranactional write in two steps. First, they write all blocks except the TxEnd block to the journal: TxBegin | Iv2 | B2 | D2 | Then, only when those writes complete, do they issue the write of the TxEnd block, thus leaving the journal in this final, safe state: TxBegin | Iv2 | B2 | D2 | TxEnd What we really need to understand here is the atomicity guarantee provided by the disk. It turns out that the disk guarantees that any 512-byte write will either happen or not (and never be half-written); thus, to make sure the write of TxEnd is atomic, one should make it a single 512-byte block. [SIMPLE JOURNALING: COSTS] Unfortunately, there are costs to journaling. Although recovery is now fast (scanning the journal and replaying a few transactions as opposed to scanning the entire disk), normal operation is slower. In particular, for each write to disk, we are not also writing to the journal first, thus doubling write traffic. Further, between writes to the journal and writes to the main file system, there is a costly seek. Because of these costs, people have tried a few different things in order to speed up performance. For example, the mode of journaling we described above is often called *data journaling* (as in Linux ext3), as it journals all user data (in addition to the metadata of the file system). A simpler (and more common) form of journaling is sometimes called *ordered journaling* (or just *metadata journaling*), and it is nearly the same, except that user data is not journaled. Thus, when performing the same update as above, the following would be written to the journal: TxBegin | Iv2 | B2 | TxEnd and D2 would simply be written to the file system. This modification does raise an interesting question; when should we write D2 to disk? Here are two possible options. Option 1: 1 - Write D2 to disk 2 - Write the transaction (containing Iv2, B2) to the journal 3 - Write the blocks (Iv2, B2) to the file system proper 4 - Mark the transaction free in the journal Option 2: 1 - Write the transaction (containing Iv2, B2) to the journal 2 - Write the blocks (Iv2, B2, and D2) to the file system proper 3 - Mark the transaction free in the journal Thus, in ordered journaling, we can either write the data block D2 before the transaction, or after (with the other blocks). Option 2 is simpler and may perform better, but has a problem: it may end up with a consistent file system but one that has Iv2 pointing to garbage data. Specifically, if the file system is writing Iv2, B2, and D2 to disk and only manages to complete the first two writes before crashing, D2 will not be on the disk. The file system will then try to recover, but notice that D2 is *not* in the log. Thus, it will replay the writes to Iv2 and B2, and produce a consistent file system. However, Iv2 will be poiting to garbage data. Thus, option 1 is attractive, as it guarantees that Iv2 will never point to garbage by forcing it to disk first. In real systems, metadata journaling is more popular than full data journaling. For example, Windows NTFS and SGI's XFS both use non-ordered metadata journaling. Linux ext3 gives you the option of choosing either data, ordered, or unordered modes.