**THE LOG-STRUCTURED FILE SYSTEM** ================================== # 0. Take away - use log structure, hence recovery is fast (compare to fsck, which need to scan the whole disk) - copy on write + garbage collection - buffer all write (to data and metadata) in large segment, and write the full segment to unused part of the disk, hence leverage the sequential bandwidth - but this has some challenge: + information is scatter on disk, how to keep track? ==> imap, CR + garbage collection ==> segment summary block, version number, hot and cold - What is in segment summary block + for each data block, store inum, and offset + optimization: version number # 1. Motivation - CPU is getting faster ==> disk is bottle neck - Disk: transfer time increases sharply, but seek time increase slowly (big gap between random IO performance and sequential IO performance) - Memory is cheaper and bigger, hence larger cache: + reads mostly are served by cache + hence, disk traffic is dominated by writes - Existing file system performs poorly on common workload (small workload, e.g FFS needs large # writes to create a new file of one block) - File system is not RAID-aware (small-write problem) ==> Ideal file system: - focus on write performance, try to make use of sequential bandwidth of disk - perform well on common workload(not only write data, but also update metadata) - work well on RAID as well as single disk Problem with existing file system: - info is spread around the disk, causing many small access (even with FFS) - write is synchronous (at least with metadata update) # 2. Basic idea of LFS - borrowed from database community - when write to disk, buffers all update (data & metadata) to disk into *segment* (i.e a large size buffer, usually 512 KB, to leverage sequential bandwidth) - when segment is full, it is written to disk in one, long *sequential* transfer to an *unused part* of the disk (note: LFS never overwrites existing data) Why this is good? - *large* segment is written *sequentially* to disk, hence fast. - Write is asynchronous (i.e buffered in segment before flushed to disk) E.g: write a data block to a file that is already exists but has length zero. We need to update the data block D, and also the Inode of that file. Hence, we buffer all update to the segment | D | I ... | Note, LFS must have an idea of where to put the segment on disk, so that it can update pointer in I correctly. Once the segment is full, it is written to disk as above. One more note, LFS view disk at a set of chunks or segment. When write, choose one free segment on disk to write to. No need to guarantee the contiguity of segment. Said that, 2 segment write may not contiguous on disk Now, some challenge: - data and meta data are scattered on disk in the log - and each has different version - the big problem of *garbage collection* ==> Question: + How to keep track of those thing? more specifically, how to keep track of the Inodes? + How to read? + how to recovery? # 3. Keeping track of the inode: - in FFS, inode can be found in fixed location - in LFS, use imap (inode number, location on disk) - we need to keep the imap persistent + alternative 1: store imap on fixed location ==> suffer from seek overhead when update + alternative 2: keep it in the log (LFS use this) Hence when write to a file: | D | I | Imap .... - Then, how to keep track of the imap? ==> Use checkpoint region: a fixed location on disk storing pointer to latest pieces of imap. (In fact, there are two checkpoint regions, which are used alternatively, to prevent crash during crash recovery. Each check point region contains a time stamp. During crash recovery, LFS choose the checkpoint region with latest timestamp) Ok, what happen when a read occurs? - of course, look up the imap - if the imap is not in memory, read the CR, and fetch the imap (so that subsequent lookup (hopefully) does not need an IO) - using the imap, fetch the inode of the file, and follow the pointer # 4. Problem: Garbage Collection - data & metadata are scattered on disk, with different version (can be use to implement versioning file system) - but LFS only keep track of the latest one - Hence, old versions need to be cleaned up: + read a set of segments, write out the live blocks to other segment + and free the old segment Hence, the question? - mechanism: + how to determines which blocks in segment are alive? which are dead? - policy: + how often the cleaner run? + which segments should it pick up? + how the live blocks written out? # 4.1 Determining live blocks (mechanism) - use *segment summary block* (as part of each segment) + for each data block, store the inode number and the offset - for a block on disk, to determine if it is dead or live: + read it segment summary block to find out the inum and offset + look at the imap to fetch the inode, and get the pointer + if the pointer match, then this is live, otherwise the block is dead - optimization: *version number* + for each file, store a version number on the imap and on disk segment + when a file is deleted or truncated to zero length, increase the version + if the version number on disk smaller than the version number in the imap the block is dead (avoid extra read) # 4.2 Policy issue - When to clean: + periodically: every 30 seconds + when the system is idle + when free space drop below a threshold - More difficult: which segments to clean + LFS uses *hot* and *cold* segment (this is just an heuristic) > hot segment: segment that contains hot block, i.e block whose content are being frequently over-written > cold segment: segment that contains block whose data is rarely over written The BASIC IDEA: *cost-benefit policy* should clean the cold segment sooner and hot segment later. WHY? Because: - cold segment may have few dead blocks and the rest of its content are relatively stable - for hot segment, we should wait for a long time, as more and more blocks are getting over written, hence it is likely that the blocks on that hot segment become all dead soon ==> hence, no cost to write live blocks (benefit/cost) = free space generated * age of data / cost = (1-u) * age / (1+u) u = utilization (fraction of live data) in the segment cost = 1 + u, because we need to read the whole segment, and write out live data free space generated = 1 - u (obviously) Intuition: choose segment to clean which give most benefit and least cost. ==> allow cold segment (which has higher age) to be clean with much higher utilization than hot segments. To implement this policy, again, need new data structure: *segment usage table*: for each segment, the table store: - number of live bytes (for calculating u) - most recent modified time of any block in the segment (to calculate age) *Segment usage table* are written to log, and the addresses of its blocks are store in checkpoint (which is a fixed region on disk) # 5. Crash Recovery - LFS does not need and fsck, hence recovery is fast - Simple case: just read the CR, and forget about anything happens after CR is written (risk of data loss) - Alternative: roll-forward + read checkpoint region + following the pointer from CR to the end of the log, and apply any change happens since the last checkpoint > read segment summary block to recover recently written data > if there is new version of inode, update the imap > if there is data block without new copy of inode, just discard that data block - Tricky part: consistency between directory entries and inode + it is possible for a crash to happen when an inode with new ref-count has been written to disk while the block containing corresponding directory entry has *not yet* been written or vice versa + solution: use *directory operation log*, for each directory change > operation code > dir entry locatin > content of dir entry > new ref count for the inode name in dir entry Make sure that this directory operation log appear before change to dir block or inode. ==> hence, during recovery, once we see this directory operation log, just go ahead and replay it (like redo logging) But, this incur additional synchronization issues: each checkpoint must represents a consistent state (i.e directory operation log is consistent with the inode and directory blocks on the log) ==> while checkpoint being written, prevent directory modification # 6. Evaluation (should provide comparison in this section with FFS under various workload) - take: worse than FFS in term of large sequential read, due to seek overhead - LSF achieves *temporal locality*: information is created or modified at the same time will be grouped closely on disk - FFS achieves *logical locality*: files in the same directory, etc (assume some certain workload pattern) ==> pay extra write to organize info on disk for assume read pattern Take: if *temporal locality* matches *logical locality*, LFS ~ FFS For example: if we create a lot of small file, each in different directories, and re-read all of them immediately (assume all files can not fit in cache) then FFS perform worse than LFS This is key to come up with some workload that is better for each LFS and FFS