« A Fast File System for UNIX | Main | The Zettabyte File System »

The Design and Implementation of a Log-Structured File System

Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Trans. on Computer Systems 10(1), February 1992, pp. 26-52.

Reviews due Thursday, 11/6


Summary: Sprite Log File System (LFS) is a FS that buffers writes and then batches them into a single sequential write to disk. This reduces the seek time for writes while maintains comparable read times.

Problem: While reads and data writes can be asynchronous from/to cache, writes of metadata must be synchronous with the disk. Aditionally, data tends to be spread across the disk, which favors seeks.These tie the performance of FS to that of slow disc in the case of workloads of many small files. Sprite attacks this issue.

Contributions: Sprite introduces a revolutionary concept by using a dynamic log, where seek times are minimized, resulting in close to 100% utilization of disk bandwidth. The log mechanism poses new technical issues, most importantly free space management (ensuring extent availability) which they solve by using a global/local mechanism, where golbal decisions handle large segments. Although performant, Sprite also handles very well error recovery, using a simple checkpoint based mechanism.

Techniques: Writes are buffered in memory cache and then saved to disk into a single large dis-write, where data and metadata are batched together and written into a segment. The basic structures (inodes, indirect blocks) are similar to Unix, but the layout is different: inodes are not at fixed positions, they go with data and are retrived using a inode map which is small enough to fit in memory (so few reads required), but is saved at checkpoint regions. for free space management they use a two level approach: high level use of segments and low level of segment cleaner. Segment are chained, but thy are so large that R/W time of a segment is greater than segment seek time which tends to diminish the pointer chasing overhead. Cleaning is done in order to compact information in segment, reduce fragmentation and produce available segments. File contents have Unique IDs UID=(version, inode number) that help bookkeeping of data liveness and clean the obsolete data (as if unreachable from the most current version).

Weaknesses: They rely on large write buffers, which are vulnerable in case of system crashes. They require a separate cleaner. The benchmarking was done using a simulation.

The Problem

Rosenblum et al. note that main memory and on-disk cache sizes are increasing while disk seek times are remaining roughly the same. As a result, they feel existing file systems are going to see improved read performance relative to their write performance as hardware continues to improve.


To address the trend above, the authors propose a file system that temporarily caches successive writes in memory, eventually writing them to disk together as a single log entry. This results in a smaller number of large writes, taking advantage of the large disk bandwidth while decreasing the amount of overhead involved in finding unused spots on disk to place data. Map structures are used to keep track of the physical location of the latest version of a file on disk. A segment cleaner periodically scans the file system log, physically grouping chunks of live data together so old data can be overwritten by new writes.

Key Performance Trade-off

Extra memory is required to buffer writes, reducing the amount of disk write (CPU) overhead. Data buffered in memory will now be lost in the event of a system crash. Crash recovery is much faster; however, as the entire file system is a consistent log (with the exception of data produced by any interrupted writes).


The trends noted are oversimplified. Hard drive sizes tend to increase with respect to time. As they do, the average size of the files stored on them also tends to increase. As a result, many of the tasks users care about involve increasing amounts of data, so to retain comparable performance larger cache and memory sizes are needed.

The log maintained by this file system puts off extra bookkeeping in the hopes that a segment cleaner will do it later. Its unclear to me that this delayed bookkeeping will improve performance in the case where a constant stream of write requests are generated by a system, resulting in a continuous need for bookkeeping.

The Design and Implementation of a Log-Structured File System

The paper describes the design, implementation and performance of the log
structured file system which maintains logs of writes to speed up the write
performance (in terms of both bandwidth and speed ) and also enhances crash
Description of the problem being solved
The problem being solved is to come up with a better performing file
system for disk writes and also not lose on the read performance. The
motivation is that synchronous writes takes long time and also most often the
seek latency dominated the write costs.

Contributions of the paper

1) Log structured file system replace synchronous writes to random disk
offsets with asynchronous writes to the log and thereby vastly improving the
performance because it amortized the seek costs over large segments of
buffered data.

2) Recovery is simplified compared to Berkeley FFS because recover has only to
deal with the logs and therefore need not scan the whole disk to find

3) There are many small innovative optimization in the segment cleaner. For
example, differentiation is made between hot and cold regions so that hot
regions are cleaned more often than cold regions.

Flaws in the paper

1) Segment cleaning introduces concentrated disk activity and this poses a
problem when the disk workload is always high and there is not enough idle

2) The synthetic workload does not do a good job in the performance
comparision with the traditional file system.

Techniques used to achieve performance

1) Batched writes to enhance performance for the common case of small writes
to random disk offsets.

2) Optimizations for more frequent use cases : Segment cleaning differentiates
between hot and cold regions.

Tradeoff made

1) Flexibility at the cost of later re-organization : LFS relocates modified
blocks to make the write requests sequential. However, time must be spent
later to reclaim contiguous segments later using segment cleaner.

Another part of OS where this technique could be applied

1) The technique of batched writes for performance can be used in networked
file systems.

2) The technique of optimizing for frequent cases is used in many places.
Defragmentation in FAT file system is an example of this.


Log-structured file system is a new technique for managing disk I/O where the disk is treated like a tape and and all modifications are written sequentially to the disk. Having large extents of free space is crucial for fast writing and the paper uses a cost-benefit based policy to compact segments and ensure large unfragmented space.

Problem Statement

The objective of the paper is to improve the performance of I/O susbtantially so that it catches up with the exponentially increasing processor speeds and does not become a bottleneck for scalability of systems


1. Changes to the filesystem are buffered in the file cache and all modifications are written to disk sequentially in a single write operation.

2. LFS makes all writes asynchronous irrespective of whether it is inodes or data blocks or directory entries. For small files, most of the overhead lies in writing these metadata entries and writing them synchronously as is done is Unix FFS is a bottleneck for performance.

3. Inodes are no longer in specific positions in the disk and hence an inode map holds the positions of inodes in the disk. Inode maps are small enough to be cached in main memory.

4. The disk is divided into large (1 MB) units called segments. A segment is written sequentially from beginning to end. Thus a fully free segment is necessary for the segment to be fit for writing. A segment cleaner program does this work by compacting live information in a segment.

5. Which segment to compact ? - The segment which has the highest benefit/cost ratio is chosen. Benefit is the amount of free space that can be obtained and the amount of time the obtained free space shall stay free.


1. When the amount of idle time available is very less, it might be very costly to utilize this time for segment cleaning.

2. In the worst case, data in a file could be very randomly distributed in LFS and hence the seek time for sequential access will be very high. Compare this with FFS which maintained some clustering of file data and prevented it becoming totally random.

3. When disk utilization is very high, the write cost increases very rapidly as against the stable write costs in the case of FFS.

Basically, LFS outperforms the FFS in the common case with lots of small and non-random writes, low disk utilization etc. But it performs badly for the worst case


1. Low disk utilization gives good performance in LFS but the cost of usable byte is high and vice versa.

2. The paper focuses on making writes very fast at the cost of reads becoming slow in the worst case.

Techniues used

1. Make the common case fast - Common across all systems. Example - Processors support instructions with constant operands because such instructions occur frequently.

2. Logging for performance and crash recovery - Widely used in databases

This paper presents LFS, a completely different way of maintaining a file-system in a sequential log fashion for the Sprite OS. The authors show that it outperforms UNIX FFS (5-10%) by small-file writes by an order of magnitude and matches read and large write performance. LFS is shown to achieve nearly 70% of disk bandwidth for writing, which even includes cleaning overheads.

Prior to LFS, traditional file systems suffered from two major problems. Files are laid out sequentially but different files are physically separated (there is logical locality, but not temporal). Further, metadata for a file is separated from its contents. This results in high seek times, which wastes up the disk bandwidth (esp. for small files). Secondly, the writes are all synchronous, which couples the applications performance to disk bandwidth which can not scale with faster CPUs. This paper focuses to present LFS which solves these problems.

Broadly, there are 2 major contributions of the paper. First, LFS speeds up file writing by writing sequentially to the disk. And uses indexing information to read back from the log efficiently. Secondly, an efficient cost-benefit based cleaning algorithm helps to maintain large free spaces on disk for fast writing.
Thirdly, LFS provides fast and efficient crash recovery by checkpointing and roll-forwarding in the log. The associated overheads of checkpointing can be controlled by tuning the periodic interval after which checkpointing is done. Finally, providing asynchronous writes (by batching) in LFS, decouples application performance from disk bandwidth constraints. Data can be written to the disk in a single large IO to consume all of the disk bandwidth.

The micro-benchmarks used for evaluation are overly-optimistic. Low disk utilization (17%) force no cleaning overheads and it is not an apples-to-apples comparison with UNIX FFS. For real applications, all the measurements are started "several months" after putting the filesystems into use. Why so? Start-up effects may be revealing. Swap files workloads may still be relevant during the startup time when disk utilization may not be very high for user directories.

Techniques used:
Batching the write requests by asynchronous writes helps to increase disk bandwidth utilization. Age-based cleaning exploits temporal locality of data (similar to generational garbage collectors).Cost-benefit based segment cleaning helps to segregate cold data, and differentiates it from the hot rapidly-changing data.

A cost-performance trade-off is made for evaluation purposes. If disk space is underutilized, high performance can be achieved. But if disk utilization is increased, cleaning overheads go up thus trading off with performance.

Alternative uses:
Group-commits in database systems, write-ahead logging for crash-recovery, versioning systems such as cvs and svn (maintaining diffs instead of logs). LFS is gaining importance for flash disks, but there are still many unexplored questions before we can exploit its full potential.

The paper presents log structured filesystems; specifically the authors' implementation called Sprite LFS. Log structure here is a way of organizing the filesystem as a log of changes rather than the traditional layout of spatially arranged blocks. This organization is shown to have advantages of faster write speeds and better crash recovery.

Problem Summary
With CPU speeds an Memory sizes on the rise, file systems no longer need to be stingy on these resources. The authors recognize that greater and greater read accesses can be served through file pages cached in the memory and hence they predict that future file systems will face writes as a major contribution to the workload. Likewise, which disk capacities have been on the rise, mechanical aspects of the disk have not seen much changes and hence seek times have more or less remained the same. This motivates a radical reorganization of the filesystem which avoids seeks as far as possible.

- LFS organizes the filesystem as a log of changes. This allows sequentially writing all the changes one after the other to the disk thereby increasing the write throughput tremendously.
- LFS compares the traditional concept of logical locality against the idea of temporal locality. The basic idea is that data generated with temporal proximity is likely to be used with temporal proximity. With this in mind, it makes sense to store recently modified/created data together, and the observation that Sprite LFS's read speeds are comparable to traditional filesystems sort of justifies this argument.
- Crash recovery is made simpler by the use of the log structure and consistency checkpoints. With these, the amount of data that needs to be checked for consistency upon crash becomes a configurable parameter tunable by the system administrator. This is a huge saving in time compared with the traditional handling of checking the whole disk.
- The roll-forward mechanism proposed with consistency checking allows for recovery of an even larger amount of data in the event of crash, thereby making the system really robust.

- The log structure makes a segment cleaner (an incarnation of a garbage collector) necessary for the filesystem. This is a tradeoff / overhead here.
- The authors' claims that reads can be serviced from cache most of the time doesn't always hold good. For one, the increasing CPU capabilites have also resulted in an increase in working set sizes of applications. Also, many files like mutimedia files are just read once sequentially.
- Over writing content already in a file seems to be extremely expensive as now the inode, inode maps and indirect blocks (if any) also need to be changed.

- The technique used here is bulking (with the help of buffering) and temporal locality. Dumping the data in bulk into logging segments after buffering them for a while in memory allows feeding a huge amount of data to the disk to write on to sequential blocks.

- The tradeoff involved here is the necessity for removing stale data through the use of a garbage collector and the overhead of maintaining more indexes.

Alternate uses
- I guess a similar technique is what's used in a multisession cdrom although the complexity of segment cleaner is not present here.


In this paper the authors present Spite LFS, a log-structured file system. Spite LFS speeds up file writing and crash recovery, by writing all modifications to disk sequentially in a log structure. The authors describe the basic features of the file system (log, segment cleaning mechanisms and policies, crash recovery mechanism) and then present several experimental results.

The fact that CPU speed has increased dramatically, whereas disk access times have only improved slowly, is the reason why more applications become disk-bound. The existing file systems couldn’t cope with the new technology, so the system frequently becomes unbalanced. The authors try to resolve this problem by designing a log-structured file system which will use disks more efficiently than the existing file systems.

The main contribution of the file system is that it improves writes. More specifically, it buffers modified blocks and then transfers them to disk with only one seek. This can have a great impact on the performance as file caches become larger. Writes are now asynchronous, in contrast with FFS where all metadata structures are written synchronously. Another contribution of the system is the cleaning mechanism used for the management of free space. The goal is to maintain large free extents for writing new data. A combination of copying and threading is used to achieve that. Moreover, a segment cleaning policy based on identifying hot and cold data and cost benefit analysis is used. Finally, another contribution of the system is the crash recovery mechanism. In traditional file systems without logs, the system has to scan all the metadata structures on disk to restore consistency after a crash. Spite LFS uses only the most recent checkpoint.

One of the flaws of the system is that it is based on the assumption that reads don’t affect its performance. Although caches have become larger, I think that we can’t ignore reads based only on that fact. As far as it concerns the experiments, I believe that they should use real workloads instead of microbenchmarks. For example, when they compare Spite LFS and FFS , files were read in the same order as created and no cleaning is needed. This example is not realistic.

The main techniques used to achieve greater performance is the buffering of writes, so that large segments of data can be transferred with one seek, the use of temporal locality , and the use of checkpoints and roll-forward operation to restore consistency after a crash.One tradeoff is that the cleaning mechanism may add overhead. Although buffering improves significantly the performance of writes it has the disadvantage of increasing the amount of data lost during a system crash. Finally, there is a tradeoff between writing and reading. For example, there is additional overhead when performing sequential reads after random writes.

Paper describes Log Based File systems, which write all update to disk sequentially in log file. It provides faster crash recovery compared to existing Fast File System.

Existing file systems perform synchronous write. FFSs does not batch write/update and as workload includes lot of small files and small update so application performance is coupled with disk performance. As disk performance is not improving in line with processor and memory so application does not benefit from faster memory and processor.

Contribution/Work Summary:
1. LFS Does not keep any free list or bitmap data structure to keep track of free list. LFS uses FFS’s inode structure. But inodes are not located at fixed positions. Inode map is used to locate a file’s latest version of Inode. Inode map itself is located in different places of the disk, but its latest version is loaded into memory for fast access.
2. LFS doesn’t keep track of free block, it keep putting updated file/directory Inode and data block at the end of log file. It’s a circular log file so LFS needs to reuse the space.
3. LFS use combination of treading and copying approach. Disk is divided into segment and treading is done inside each segment while copying is performed at segment level. Different segment cleaning policy is proposed in the paper and cost based segment cleaning policy perform best.
4. Crash recovery is many times faster in LFS as it need no search through whole system metadata. By looking at log file and walking backward it can perform recovery

1. LFS is developed on assumption that read can be performed in memory. This assumption is correct if same file is read repeatedly and memory is able to cache it. If application read multiple files and all of them are not able to fit in memory then LFS will perform poorly as read will access Disk. Application like Database is read intensive and all data can’t fit in memory.
2. If file is written randomly and read sequentially then LFS will perform poorly as LFS need to seek. While in FFS even random write put file data block together so sequential scan will not require seek.

1. To improve write performance read performance is traded off.

Another part of System where technique can be applied:
1. Database recovery technique is also similar to LFS. There also log is written to disk before actual modified date and using log database is able to recover.
2. Hot segment kind of technique can be used in memory management. If some particular memory pages are accessed more frequently then other pages then those pages can be marked hot and while choosing pages to move out, hot pages can be give priority. One such example is we have nested loop then data in inner loop is accessed more frequently so those data can be marked as hot-set.
3. Journaling file system write changes circular log file before committing them on main file system. So this technique is used along with existing file system.

This paper introduces a full fledged log-structured file system,Sprite LFS that writes all updates sequentially in the disk. This increases the write throughput but uses a segment cleaner to maintain the usability of the disk. For efficient retrieval, it maintains additional summary information and reverse index with each block.

As the cache sizes are increasing, most of the file blocks that are accessed would already be in the cache. As a result, majority of the disk i/o would be disk writes whereas the existing Unix file system is optimized for read. It has synchronous writes(at least for the metadata) and has an additional overhead of finding the most appropriate block for each write.

- By getting rid of any layout policy and writing in terms of segments, the write speeds have been improved in the LFS system.
- Even then this doesn't effect the read speeds, as most of the time the blocks that are written together are the ones that are accessed together. So all the effort to ensure locality, can be approximately obtained by just making sure that all writes to a file are localized.
- Segment cleaner has various useful techniques for efficient usage of the disk. The differentiation between hot and cold regions helps to make sure that hot segments are cleaned more often than cold segments yet cold regions are not completely ignored.
- By the virtue of the design of the LFS, crash recovery using check-pointing becomes comparatively easier as it just involves dumping out the modified tables.

- Any kind of garbage collector has to run as a separate process and tends to use up resources specially in cases where the system does a lot of writes and deletes. Also it makes that segment unusable while the compaction takes place.
- Segment cleaner instead of invalidating blocks greedily, checks the i-node entries to figure out the whether the location is still valid. If an entry is in the third indirect block, it would require 4 pages read before you realize that the entry is stale or not. This check is done for all the blocks in the segment which will also lead to cache pollution.

- Instead of wasting cpu cycles and write times on achieving logical locality, they concentrate more on temporal locality by using a greedy approach and show that it performs almost as good as the ideal solution.
- I would assume this system would actually give even better performance in solid state devices.

The paper presents a log structured file system (LFS) which uses file cache to buffer a sequence of file system changes and writes the changes to disk sequentially in a single disk write operation. As opposed to traditional file systems, LFS writes data in a log style fashion, appending data and metadata (inodes, directory information etc.) at the end of the log. The log is the only structure of the disk and all data is resident in the log. And the LFS can improve the disk bandwidth usage for writing.
The goal the paper was trying to deal with:
Because of a lot of existing problems with current file systems including (1) lots of little writes and (2) Synchronous: wait for disk in too many places. (This makes it hard to win much from RAIDs, too little concurrency). The goal of LFS is to achieve efficient small file writes and match FFS on reads/large file writes. It improves write performance by buffering a sequence of file system changes to disk sequentially in a single disk write operation.
1. Log structured file system converts many small synchronous random writes into large asynchronous sequential transfers, and then nearly 100% of raw disk BW can be utilized
2. Fast Traditional Unix FS must scan all metadata for restoring consistency after reboot – takes a lot of time with increasing storage size. While LFS is easy to locate disk operations. It uses checkpoints on position in log in which all of file system structures consistent and complete and uses roll-forward to recover info since last checkpoint
3. If checkpoints are used to recover data after a crash, then the changes made after the last checkpoint are lost. In order to recover as mush information as possible, Roll-Forwards mechanism is used. As a result, most of the useful work done after the checkpoint is recovered and the resources spent on that work are not wasted.
The first flaw I want to mention is that LFS assumes that files get written in their entirety; else would get intra-file fragmentation in LFS. And LFS claims better performance, but how would LFS compares to UNIX if small files “get bigger” then? What is more, disks are always full for LFS. There are some other potential problems including the problem on log retrieval on cache misses and wrap-around: what happens when end of disk is reached?
The techniques used to achieve performance:
The main techniques used in this paper include free space management, crash recovery, segment cleaning, checkpoints, roll-forward and so on. For free space management, LFS divides disk into fixed-length segments: (1) Each segment written sequentially (2) Only empty segments can be written (3) Large enough to make seek time negligible (4) Older segments get fragmented meanwhile. For crash recovery, LFS reads checkpoint and rolls forward through log from checkpoint state. While segment cleaning is the process of copying live data out of a segment which reads a number of segments into memory and identify live data and only write live data back to smaller number of clean segments. And there are 2 steps to create a checkpoint: (1) Write out all modified info to log. (2) Write checkpoint region. Finally, the technique of roll-forward is also used in this paper.
(1) There is no segment cleaner implemented for traditional UNIX file system, while LFS implements segment cleaner with the overhead on additional complexity.
(2) LFS gives up "update-in-place," writes new copy of updates together, so its advantage is that writing is fast (main problem of FFS solved), but it also has disadvantages like and Complexity in reading (but cache relieves this problem) and overhead in segment cleaning.
(3) In the file systems including FFS (Fast File System), Ext3, FAT, and NTFS, locations of files and metadata are fixed, and when data is modified they are always overwritten in their original location. LFS can avoid such problem by relocating every modified block in order to sequentialize the write requests. LFS, however, has the disadvantage that it must reclaim contiguous disk space called segments for further writes, and the efficiency of reclamation (cleaning in many literatures) determines the performance of LFS. It has been shown that due to this cleaning overhead, the performance of LFS degrades when the file system utilization rises over a certain point,
Another part of the OS where the technique could be applied:
The checkpoint technique used in this paper can also be used in the other part of OS, For example, In order to improve computer reliability, several checkpoint based fault tolerant multiprocessors have been proposed in past years. These approaches deal with transient and permanent errors for general purpose computing.

This paper introduces a radically different technique for organizing data on secondary storage, the log-structured file system. The authors describe how earlier file systems don't utilize the disk bandwidth, and propose an alternate organization which reduces time spent on seeking. They describe the implementation of the Sprite file system and evaluate it's workload on simulated workloads. Crash recovery mechanisms are also touched upon.

What were they trying to solve:
While CPU speed, memory size disk capacity and bandwidth have grown rapidly, the seek time for disks have remained fairly static, since it involves a mechanical component. Increased memory in turn increases the cache available, so the bottleneck shifted from read to write. The existing file systems tended to spread data around the disk, incurring too many small seeks. Also, A large percentage of writes were synchronous, which meant that applications are I/O limited, thus unable to take advantage of faster CPUs.

The authors concentrated on improving the write case. All writes are asynchronous, and multiple changes to different blocks would be written sequentially without incurring any seeks. The writes are not in-place, so there can be multiple versions of the data on the disk. Every time data is written, the associated metadata is also written so that the new version will be found the next time the data is read.
Unlike FFS, inodes are not found at static locations, but are spread across, therefore an inode map is required to keep track of location of current version of inode.
To manage free space, the disk is divided into fixed size segments. The Spite File system uses a combination of copying data (from a segment to the head of the log) and threading (to avoid unnecessary copying of segments when the segment data is long-lived).
When the number of clean segments grows low, a segment cleaner runs and chooses some segments to clean, utilizing a metric called write cost to determine which segments to run. Access patterns for files are also used to implement a cost-benefit policy, which differentiates the cleaning policies for "hot" and "cold" pieces of data.
Recovering from crashes using a combination of Checkpoints, which are snapshots of the system in a stable state, and roll-forward, which recovers as many changes after the checkpoint as possible.

The assumption that I/O performance will be write-limited isn't necessarily true. While memory cache size has grown, working set size and the disk capacities have grown as well, and read performance is still a significant factor.
Most modern file system try to reduce seek costs as much as possible by allocating contiguous blocks to a file as much as possible. So the argument that SpriteFS uses the same number of accesses as FFS after reaching the inode isn't necessarily true.
Every data write incurs multiple metadata writes: the inode needs to be written since the block address has changed, the inode map needs to be written since the inode location has changed and so on.

Techniques used to improve performance
The central idea here is to speed up writes (and to some extent reads) at the cost of an offline overhead of cleaning.
SpiteFS trades spatial locality(consecutive blocks of a file are accessed together) for temporal locality(blocks which were written consecutively will be accessed together). It also trades in static structure(inodes at specific location) for flexibility(inodes at arbitrary locations).
another part of the OS where the technique could be applied:
Swapping in Virtual memory has to make some of the same sort of allocation decisions, which might benefit more from such fast writes.

This paper presents a new technique of file storage management where all writes are sent to disk sequentially in a log-like structure, speeding up both writing files and crash recovery. LFS is shown to perform an order of magnitude better than the UNIX Fast file system while writing.

The CPU speed is increasing exponentially but there’s no major improvement in disk access time. The fast file system spreads information around disk in a way to cause too many small seeks. It takes at least five disk I/Os, each preceded by seek to create a file. Main memory is becoming cheaper. The file reads were using caching to store data and achieve higher performance but unix fast file system was still writing several small blocks sequentially. For workloads with many small files, the disk traffic is dominated by synchronous data writes. They make it hard for application to take advantage of faster CPU speed and defeat the potential of using big file cache as a write buffer.

• Buffering a sequence of file system changes in the file cache and then writing all changes in the disk sequentially in a single disk write. The method of storing both data and metadata together results in lower seeks in file access, and hence higher performance. The structure of logs, allows batching several writes together and placing an inode in a log improves performance. The elimination of bit-map and free list saves disk and memory space along with simplifying crash recovery.
• Segment cleaning policy based on identifying hot cold data and cost benefit analysis.
• Crash recovery with checkpoints which defines the consistent states of the file system.
• Roll-forward is a recovery model that checks the checkpoint to reach a consistent forward state, if possible.

LFS tries to optimize best cases of reading, but the worst case performance could suffer a lot.
Segment cleaning can increase disk activity in a highly volatile system; paper should have probably included experiments on such a system and on a highly utilized file system. This could result in high CPU usage due to useless cleaning when all segments are almost constantly highly utilized. The paper doesn’t mention much impact of segment cleaning. The write cost doesn’t take into account updating log summary.

The file system write performance is improved by using logs, the data and metadata are localized in a compact space, reducing seek time. And several writes are buffered together (in segments) in a file cache, reducing disk traffic. The file system has been made robust by adding checkpoint data structure.

• A small performance is traded off for crash recovery. 13% of the information written on logs consists of inodes, inode map blocks, and segment map blocks, all of which tend to be overwritten quickly. This is because of smaller checkpoint interval, which forces metadata out to disk frequently.
• Writing is fast but with added overhead in reading (sequential reads after random writes lead to many seeks, while sequential reads in UFS require lesser seeks) and segment cleaning.

Another area for similar technique:
The same technique can be applied to other I/O devices for e.g. the printing jobs can be batched together and then send to printer driver instead of interrupting printer frequently. This puts the I/O bus off traffic. The crash recovery model can be applied to huge databases, and any application holding configuration information, the information can be check pointed to a file after some interval of time.

The paper “The Design and Implementation of a Log-Structured File System” discusses a file system structure which writes buffered sets of data to the disk sequentially to maximize the disk bandwidth. To achieve this, the LFS file system splits the disk into segments and executes a segment cleaner to clear segments for the buffered data known as a log.

Problem to Solve
As the CPU performance and memory size increase exponentially, while disk improvements are focusing more on capacity than performance, CPU performance will no longer be a limiting factor in disk bandwidth. Engineering and Office application workloads tend to utilize large numbers of small file accesses, resulting in at least 5 separate disk operations and seeks for each access. The Sprite LFS is trying to improve this performance.

- Migrating the idea of log based writing from write-once media to a hard disk
- Analysis of a number of segment cleaning mechanisms to determine an optimal algorithm
- Futuristic thinking in finding ways to make the system perform better as technology progresses. For example, Sprite LFS will improve performance by a factor of 4-6 as processors get faster.
- Developing a file system that integrates a level of crash recovery.

The Log-Structured File System depends on the disk to be underutilized in order to optimize the disk bandwidth and minimize the overhead in segment cleaning. However, from personal engineering experience, the quantity of data increases quickly until the disk comes reaches capacity and stays close to fully allocated. If this behavior was always the case, the Sprite LFS file system would not outperform FFS in this type of engineering environment.

The techniques presented in this paper are buffering writes, additional layer of indirection, dynamic allocation of inodes, and temporal locality. The buffering writes allows the file system changes to buffer in main memory to write a large block of changes together. In doing this, the inodes and data all are written together and use statuses such as the segment memory block and other statuses to help maintain the file system functionality. A trade-off is with this technique is increased buffering for increased performance. The buffering adds more risk with system crashes, however the file system implementation is able to restore to a clean state easier. However, changes just prior to the system crash may be lost. The performance gain of getting more disk throughput is very important due to the bottleneck of the seek time. The technique of sending a group of messages packaged together can be a valuable technique used in passing information between user and kernel level as well as inter process communication to reduce the overhead in sending multiple smaller messages.

This paper describes a log-structured file system that writes modifications to disk sequentially.

Processor speed and memory capacity have been increasing at an exponential rate, but disk access time is fairly limited by mechanics. As the cost of reads can be minimized using larger file caches, how can a file system be structured so as to minimize the cost of writes, specifically many small writes, while maintaining current performance for reads?

The authors implement a log-structured file system. Many small writes, that would be handled synchronously in traditional file systems, are buffered so that they may be handled asynchronously in larger transfers. The file system also outputs the structures that would be used by FFS for locating and reading files, so as to maintain read performance.

For the sake of avoiding fragmentation and increasing performance, the authors attempt to maintain large extents of free space. Space is divided into large fixed-size extents called segments. In order to maintain large extents of free space, live file data can be moved (“cleaned”) from segments, and a good amount of attention is paid to cleaning policy. Specifically, by examining simulations of “hot” and “cold” segments, it is determined that there is longer-lasting benefit from cleaning segments with “cold” less-accessed files.

Checkpoints are used to maintain consistency. After a crash, recovery begins at the latest checkpoint, where consistency is guaranteed, and “rolls forward,” recovering as much information as possible.

Although it was not necessarily a goal of theirs, little consideration is given to other storage technology where the premises that seeks and writes are comparatively expensive may not apply.

There are a number of tradeoffs that occur here that, while not unique to this file system, occur here. Specifically, there are tradeoffs between performance and space utilization, between performance and CPU/memory utilization, and between read and write performance. The authors also make a compromise between threading and copying data, realizing that too much threading can cause severe fragmentation but that extra copying can be expensive. An important, and readily applicable, idea from this paper is that consideration of physical characteristics of hardware can lead to better design choices.


This paper introduces Sprite LFS, a Log-Structured File System. This FS improves write performance when dealing with a workload of lots of small files. The most challenging part of these type of FS is how to maintain large free areas in the disk. This paper focuses on different cleaning policies for this system and implement one that has low overhead even for long files: cost and benefit cleaning policy. This system is compared to a current Unix FFS.

There are three main components in a system: CPU, main memory and disk. In the last years CPU has considerable higher speeds and memory has considerable bigger sizes while disk access speeds are difficult to increase. In order to get full advantage of the CPU speedups disk accesses need to improve. Having a bigger memory size allows having more files cached, so the problem becomes how to improve writes to the disk. A log structured file system allows to write all modifications to disk in a sequential way in one only I/O operation.

This is the first time that a log structure is used as the only structure to write data to a file system. This structure writes all the data to the disk in a sequential manner using one only disk operation. This makes writes really fast by exploiting temporal locality.

They propose a simple cleaning policy, this policy is based on the benefit-cost of cleaning the segments instead of only considering the utilization to decide which segment needs to be cleaned. The benefit is composed by the space that is going to be recovered (low utilization) plus an approximation on when it is going to be modified (if the free space is going to be free for a long time). This way they clean segments that have low utilization and also others that have higher utilization but are not likely to be releasing free space soon. This is how the obtain the bimodal segment distribution that is considered to be optimal.

Recovery mechanisms are easier and faster for log-structure file system. Here they propose the use of checkpoints and roll-forward mechanisms. Checkpoints record the know state of the system at a given point in time and the roll-forward mechanism tries to recover the data that was modified between the last checkpoint and the crash of the system. Recovery mechanism only need to examine the last written part of the log.

One of the main advantages of FFS is that there is two block sizes to reduce wasted space. LFS uses a block size of 4KB and uses many more data structures to keep the state of th FS. How does this affect the wasted space in the system? Is this more than the 10% free space that layout policies needed in FFS? All the experiments suppose an average disk utilization of around 75% percent is this the optimal?

For the measures they choose a segment size of 1MB. How is this number chosen? Is this the optimal size? How does it affect overhead? A directory log is used to store the changes in directory-file links before any changes are written to disk. How is this space managed? How big is it?

The system is compared to Unix FFS using a synthetic benchmarks that creates 10000 one-kilobyte files and reads them back in the same order as they were created. They claim to have a 10 speed up compared to the FFS. This benchmark is non realistic and benefits the LFS over the FFS, all the files are created first and then read in the same order as they were created. The workload in this benchmarks does not make the LFS have to perform cleaning at any time, so the most important part of the system which is the cleaning overhear is not measured.

The performance is obtained by using the log-structure file system that writes sequentially to the disk. A low overhead cleaning policy assures high performance is maintain. The checkpoint are another mechanism to improve performance when doing crash recovery.
They are choosing a more complicated data structure system in order to improve speed in the writes to the system. They also use mechanism to recover large free space in disk. They trade complexity for speed. Temporal locality is a technique used in many parts of OS such as page replacement policies and caches.

Summary: The authors of this paper present the Log-structured file system as an alternation to the Unix FFS. By using slightly different structures the authors were able to achieve quicker writes and recovery from crashes while minimizing used disk space and utilize disk bandwidth up to 70%.
Problem to Solve: The problem with FFS is that writes are slower since writes are done synchronously, thus the system must wait. The authors believe these can be done quicker in an asynchronous way, along with the fact that files are typically small so there is no reason not to optimize for smaller size files. As CPU speeds were increasing, the slower writes hindered overall system performance.
Contributions: One contribution is the introduction of a log as the file system structure itself. This allows for the introduction of idea of buffering file system changes until a full chunk can be written at once. The log structure allows for quick writes and easy random access retrivals with the help of inodes. Another contribution they make is the conclusion that many reads come from the cache and writes typically result in small files therefore optimizing the system for this case increases performance drastically. Since writes take the most time, this is minimize by just writing everything as it comes along and inodes still allow for similar performance reads when necessary. Another advantage of this is that even if large files need to be read, the seek time ends up being amortized as the rest of the file is read in and inodes still allow for quick location discovery of these other blocks. Next, another important contribution is the introduction of checkpoints and roll-forward. Another databases sometimes have these features, it was essential to this system to allow for quick and easy crash recovery. Also this allows the directory operation log to make sure directories and inodes reflect the current state that they should. Lastly, they introduce segment cleaners, which turns out to be an interesting way to defragment disk space. One nice thing about them is that they can even move live data and a built in mechanism is presented try to move the 'coldest' data first.
Flaws: One flaw of the paper is that they use mirco-benchmarks to test the performance of the system. Although this provides results closer to a real system implementation it still is not completely realistic. All files created were 1KB, which was less then the 4KB block size and they didn't write to the disk enough to ever get the segment cleaner to run. Another problem is that the roll-forward mechanism is never implemented into production system yet the paper presents system recovery time in table 3. Another problem is with Figure 8, because it doesn't include time for any segment cleaning. It is realistic to present best-case performance, but segment cleaning is still necessary in situations where the user would see best case performance. Another problem is that when disk utilization gets really high the system becomes more inefficient because it is constantly looking to clean segments when there is not a lot of room to move segments around.
What tradeoff is made: One tradeoff is that during low utilization performance is very high, but the cost to achieve this per byte is high. Another tradeoff is that the authors assume that most reads are from the cache, therefore read take roughly the same amount of time as other file system if the read is from disk, but this is at the cost of faster writes. Also, the log system is able to compress data on disk more, but at the cost that files can become split across multiple different parts of the log depending on when the file is increased.
Another part of the OS where the technique could be applied: One part in which checkpoints are used is in version control software such as cvs. CVS keeps a log of the most recent changes and allows users to roll back through this log. CVS though does have features though that beyond this log structured system such as the ability to revert back to one of the checkpoints.

In “The Design and Implementation of a Log-Structured File System,” Mendel Rosenblum and John K. Ousterhout describe a new way of storing data on disk that shifts away from conventional block-based storage to maintaining a log of file system changes.

Prior file systems suffered from slow read speeds and even slower write speeds. The reason for this lack of write performance was that files were generally accessed in fairly small and usually non-consecutive blocks. Particularly concerning was the fact that processor performance continued to improve and that the inefficiency of file system access would impede overall system performance.

• Buffering file system writes so that a single large segment of data could be written to disk instead of an assortment of smaller writes. Fewer and larger writes would thus minimize the overhead of having to find separate locations to write data, causing improved performance over prior file systems.
• Devising a method of efficiently “cleaning” a hard disk to ensure that large segments of consecutive free space were available. This was a necessity since all writes were now large and thus required large segments of free space.
• Maintaining an ability to perform fairly quick reads (relative to prior file systems) even with a new file system structure (by maintaining an index to locate the data).

• The necessity of the file system cleaner may have adverse performance effects on some systems that are already primarily CPU-bound.

Combining multiple, distinct writes into larger groups to minimize the overhead of finding a place to write helps improve overall performance (at the cost of requiring written data to be grouped with other data that was written at about the same time). The benefits of this idea are also visible in reading data, for example when bringing in entire pages from disk to memory.

Rosenblum and Ousterhout present a log-structured file system as the solution for maximizing disk bandwidth usage. The proposed file system outperforms traditional designs significantly when writing small files.

The main contribution of the paper is the new organization of the filesystem. The two challenges are how to read a specific file without scanning the log to locate it and how to manage the free space that is left after deletions and overwrites. They solve the first problem by creating an index structure which is stored in memory and directly points to inodes in the log. For the second problem, they create a cost model and try to discover the best action plan among copying data and seeking to the next log segment. They provide simulation results for the quality of their algorithm and show real timing results for the overall efficiency of the file system.

The evaluation section could be slightly improved. Instead of relying on synthetic microbenchmarks, they could also capture a trace of real user actions and replay them in both filesystems. This would not be as difficult as today, because the Unix FFS implementation is completely synchronous. It would also be great to see the performance of starting up the system and recovering after a failure. Finally I am wondering about the applicability of this research today. Log-based filesystems have not taken over the world, despite the impressive performance results. I believe that the only log-structured file system in widespread use today is JFFS2, which is primarily used for embedded devices.

The new filesystem structure makes it easier to recover from a crash and fully exploits the sequential write capabilities of modern disks. On the downside, it relies on a large cache with a suitable replacement policy to get good results.

Post a comment