The Design and Implementation of a Log-Structured File System

M. Rosenblum and J. K. Ousterhout @ UC-Berkely

Proccedings of the Thirteenth ACM Symposium on Operating System Principles, October 13-16, 1991, pages 1-15

Problems of Traditional FS

Applications call read a lot more than write. But real access to disk is predominated by write because:

Most reads are served by cache

To maintain the file system consistent, meta-data, indirect block, and directory blocks are often written through

If you run NFS server, every write request by client results in synchronous write to the disk in your server

Also write is usually slower than read, because:

Cache in disk drive reduces read latency

Write is rather random due to periodic write back

Writing the Log

Gather dirty pages

Calculate disk addresses for each block

Update inode & indirect blocks with the new disk addresses

Package those data blocks, including the indirect blocks, and meta data into a segment

Update inode map: Note that inode map is not included in the segment

Write the segment to the disk

Each segment is written contiguous disk blocks

Segments are chained by links. Hence inter-segment traverse possibly requires disk seeking

Checkpoint File System

Find a place to write checkpoint in disk
Write inode map and segment usage table to the disk
Update Check point region with the disk address found
The location of Super block & Check point region are fixed in disk
Note that at this point file system is consistent!
Along the path from inode map to inode to data blocks, every information is consistent

Reading a File

Locate inode of the file

Locate the current inode map by following the pointer of the Check point region

inode map is usually found in memory because it is very small and frequently accessed

Index into the inode map using the inode number

Locate the inode by following the pointer of inode map

Reading data blocks is same as that of FFS

Recovery from Crash

Load inode map and segment usage table from the latest checkpoint into memory

Replay the logs of segments which follow the checkpoint segment

For each inode in the segment being replayed, update inode map

Segment usage table is also updated

Stop if a segment with older timestamp or end-of-segment-chain is reached

Cleaning Segments

Append only logging should somehow wrap-around huh?

Overwritten data, meta-data blocks are dead

Also blocks of deleted file are dead too

Cleaning segments and making a room for next logging is done by cleaner process

Procedure

Select segment to clean by referring segment usage table

Segment usage table has [# of live bytes, last update time] per segment

Liveness check

For each data block in the segment, check if it is still live (by following the current inode link?)

For each inode in the segment, check if it is still live by comparing the version number of it against that in inode map

Move live blocks to new location - same procedure as writing logs