« The Google File System | Main | Scale and Performance in a Distributed File System »

Optimistic Crash Consistency

Optimistic Crash Consistency. Vijay Chidambaram, Thanumalayan Sanakaranarayana Pillai, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. SOSP 2013

Reviews due Tuesday 4/5.

Comments

Summary :The authors in this paper propose optimistic crash consistency, an approach to crash consistency in journaling file systems. It is based on an optimistic commit protocol built on new techniques such as asynchronous durability notifications, transactional checksums, delayed writes and selective data journaling to provide correct recovery from crashes and provide high performance. The authors implement this approach within a Linux variant, OptFS, introducing file primitives, osync() and dsync(), that decouple ordering and durability of writes. The authors also evaluate OptFS using robust tests comprising of micro and macro benchmarks across various workloads, and demonstrate from the collected empirical data that it conforms to correctness and recovers to consistent state.
Problem : Prior to the introduction of write buffering, storage devices adopted simple read and write semantics. With the onset of write-buffering, in which disks writes could complete out of order, gave rise to the need to complete writes in order to ensure consistency even in times of crash. Out-of-order writes further complicated known recovery techniques in modern journaling and copy-on-write file systems where recovery from crash cannot be ensured without ordering. Pessimistic cache flushing is expensive due to a] Flushes render I/O scheduling less efficient, b] It unnecessarily forces all previous writes to a disk, c] Disk read may exhibit long latency waiting for pending writes to complete and d] Flushing conflates ordering and durability. Probabilistic crash consistency on the other hand aims at correctness sacrificing performance by disabling flushing, as the authors state that this approach rarely leaves the file system in an inconsistent state. However the probabilistic approach is insufficient for applications that require certainty in crash recovery, hence they propose optimistic crash consistency that exhibits high performance and deterministic consistency.
Contributions :OptFS aimed at combining the benefits of both worlds(pessimistic and probabilistic consistencies) to provide high performance and deterministic consistency. Separation of ordering and durability was the crux to this performance benefits. The authors have clearly explained the semantics of pessimistic crash consistency and probabilistic crash consistency and the factors that affect each of the above. This enables the reader to understand the semantics, pros and cons of each of the above scheme thereby throwing light on the problem at hand to ensure consistency without sacrificing performance. The authors propose this could be obtained by doing away with the flushes interleaved between D, Jm, Jc and M. Hence their proposed optimistic crash consistency comprises of many ideas to achieve this : a] Checksums to remove the need for ordering writes by generalizing metadata traditional checksums to include data blocks, b] Asynchronous durability notifications to delay check pointing a transaction till it has been committed durably, c] In-Order Journal Recovery : recovery process reads the journal to know which transactions were made durably, and discards or ignores those write operations that occurred out of desired order. OptFS’s journal recovery proceeds inorder, sequentially scanning journal and performing checkpoints in-order. d] In-Order Journal Release : to ensure journal transactions are not freed until all corresponding checkpoint writes are confirmed durable. e] Checksums : used to detect whether or not a write related to a specific transaction has occurred, enforces ordering Jm->Jc and these occur after D referred to as transactional checksumming. In addition metadata transactional checksumming and data transactional checksumming enable optimistic journaling to ensure metadata is not checkpointed if corresponding data was not durably written. f] Background write after notification : VM subsystem is used for performing periodic background writes, where journal writes do not have to wait for previous checkpoints or transactions. g] Reuse after Notification : ensures that durable metadata from earlier transactions do not point to later ones by freeing only those data blocks that are finished or flushing them as per memory requirements. This is unlikely to cause system to wait or harm performance. h] Selective Data Journaling : Block allocation information is used to determine if the blocks need to be journalled like metadata. In place data blocks are not overwritten until transaction is checkpointed.
Evaluation : The authors have taken the effect to first analyze the probabilistic consistency approach to layout the factors that affect the systems to become inconsistent during a crash. In case of write intensive workloads, chance of random writes or frequent fsync() lead to more chances(60%) of inconsistency during a crash. Other factors that affect probabilistic consistency include queue size that stores writes and the distance between journal and data locations. OptFS is an implementation of the authors proposed optimistic crash consistency, by extending ext4. OptFS performs better (2x to 3x) relative to ext4 in case of random writes and file creation when evaluated using microbenchamrks. Varmail owing to its frequent fsync calls is chosen for the next evaluation where OptFS performs 7x compared to ext4. At the same time it is noticed that performance of OptFS degrades in case of workloads that have sequential writes owing to the selective journaling semantics. OptFS also shows an increase in CPU utilization(3% to 25%) and usage of memory(486MB to 749MB). The authors justify this increase by attributing it to data checksumming and background thread which fress data blocks on expired durations. The authors have also taken extra effort to evaluate it on various real world applications like gedit and SQLite where they have successfully demonstrated that OptFS provided 100% crash recovery guarantee and at the same time providing performance equivalent to systems without cache flushing. Having carried out the above mentioned extensive evaluation, I feel there are few aspects of evaluation that are missing such us quantifying increase in memory usage and CPU compute for checksumming and their impact in scenarios that have high number of transactions.
Confusion : What is Forced Unit Access(FUA) and Tagged Queuing ?

1. Summary
In this paper the authors present optimistic crash consistency, a new approach to crash consistency in journaling file systems that uses several novel techniques to obtain both a high level of consistency and excellent performance in the common case. The authors demonstrate the power of optimistic crash consistency through the design, implementation and analysis of the OptFS.

2.Problem
A single file-system operation updates multiple on-disk data structures and the system may crash in middle of updates leaving it to an inconsistent state. Available crash-consistency solutions degrade performance and thus, users are typically forced to choose between high performance and strong consistency. Many users choose performance (e.g. ext3 default configuration). Crash consistency is built upon ordered writes that penalizes the performance. File systems conflate ordering and durability, which is inefficient when only ordering is required.

3.Contributions
This paper specifically addresses the question - can a file system provide both high performance and strong consistency? To find such a middle middle ground, the authors propose OptFS - a journaling file system that provides performance and consistency by decoupling ordering and durability. Such decoupling allows OptFS to trade freshness for performance while maintaining crash consistency.
Some highlights of novel techniques in OptFS:
* Checksum to remove the need for ordering writes as re-ordering detected using checksums
* Delayed writes used to prevent reordering
* OptFS splits fsync() into osync() for only ordering and high performance and dsync() for durability and effectively decouples ordering from durability

4.Evaluation
Overall the authors do an excellent job of making the case for their research statement - a file system provide both high performance and strong consistency. They show both qualitative and quantitative analysis of the current approach (pessimistic and probabilistic crash consistency) and build motivation for the OptFS.
The authors provide evaluation of following questions that are central to their research goals:
* Preserving file-system consistency after crashes: OptFS consistent after 400 random crashes
* Performance: OptFS 4-10x better than ext4 with flushes.
* Can meaningful application-level consistency be built on top of OptFS: The authors evaluated gedit and SQLite on OptFS
However, it would have been more insightful if the authors had provided analysis at a lower level of granularity along the axes of overheads and benefits i.e. breakdown and comparison of overheads with OptFS and without one at each level such as osync() vs fsync() and so on.

5.Confusion
* Can applications be completely agnostic of underlying policy of consistency (Pessimistic vs Optimistic crash consistency)?
* It is not clear how the authors handled the necessity to make change in the hardware interface (in section 5.1 it mentions "durability timeouts" but it is not clear to me).

1. Summary
The paper introduces the notion of optimistic crash consistency, a new approach to ensuring crash consistency in journaling file systems. They demonstrate that this protocol both correctly recovers for a file system crash and also delivers much better application performance. The main idea is to decouple the ordering of writes from durability. The authors have implemented this approach in a variant of ext4 linux file system.

2. Problem
Most disks use write buffering which enables disk writes to complete out of order for better write performance. Out-of-order write completion complicates the technique for recovering from system crashes. Write ordering is mainly achieved by in modern drives using the expensive cache-flush operation, forcing all previous writes to disk. This approach is pessimistic and guarantees crash recoverability/ordering at the cost of performance. On the other hand, system may be in an inconsistent state after crash if flushing isn’t enabled. The probabilistic approach only makes weak guarantees regarding ordering/crash recoverability. The authors are addressing the problem of providing performance along with crash consistency at the cost of storing older data/metadata.

3. Contributions
The first contribution of the authors is the analysis of probabilistic journalling and the study factors affecting the consistency of probabilistic approach like workload, queue size and layout/distance. The authors have also analyzed the effect of using the expensive flush operation on performance of applications.
Optimistic crash consistency is based on two main ideas. Firstly, checksums can remove the need for ordering of writes,eliminating the need for ordering during transaction commit by generalizing metadata traditional checksums to include data blocks. Recovery can be done by matching checksums and discarding transactions that do not match. Second, asynchronous durability notifications are used to delay checkpointing a transaction until it has been committed durably. With ADN the disk informs the client that a specific write request has been completed and is guaranteed to be durable. OptFs comprises of a variety of optimistic journaling techniques that provide the same guarantees of ordered journaling along with better application performance. In-order Journal recovery reads the journal sequentially to observe which transactions were made durably and recovers up to the point of durable transaction committed. In-order Journal release preserves the property that transactions must be freed in order, T:i+1 must be made durable only if T:i is durable. Checksums are used to data blocks and meta-data to know state after crash and provide consistency guarantees. Checkpointing is done in the background and does not affect the application performance. Optimistic technique is used to ensure durable metadata from earlier transactions doesn’t point to incorrect data blocks if further transactions. Selective data journaling selectively journals data blocks being overwritten.

4. Evaluation
The authors have implemented OptFs inside linux 3.2 with additional changes to JBD2 journaling layer. The authors have evaluated their design for both reliability and performance. The simulate 400 different crash scenarios and show that OptFs recovers correctly in each case. The authors also analyse the performance of application using both micro and macro-benchmarks. The performance of OptFS is similar to ext4(without flush-consistency guarantee) for sequential write, fileserver,webserver. OptFS performs 2-3 times better for Random Write, create files, Varmail and MySQL. The only case where is under-performs with ext4 is for sequential overwrite case. The authors have also run experiments to determine the optimal journal size and set it to be at least 512MB. The CPU and memory overhead were computed for OptFS and are due to checksumming, checkpointing. OptFS was checked for crashing using real world applications like Gedit and Sqllite and was shown to recover in all the cases as compared to ext4 without flush.

5. Confusion
Why does OPTFS perform so bad in sequential overwrite case? When is checksumming performed in the transaction?

Summary:
This paper describes a new model to ensure consistency called “optimistic crash consistency” in journaling file systems by primarily decoupling ordering of writes from their durability, for which 2 new system calls osync and dsync have been introduced. The pessimistic and probabilistic crash consistency models have been explained and used for comparing and contrasting. Primary intention is to provide consistency equivalent to pessimistic crash consistency but at the same time provide performance similar to the probabilistic crash consistency approach.

Problem:
Write buffering lead to the issue of out of order disk writes, which resulted in inconsistencies in the case of crashes. In the case of pessimistic crash consistency, performance of the application gets affected due to flushes whereas probabilistic crash consistency provides no protection from crashes, thus this approach solves the problem of providing a feasible and high performant model that provides crash consistency.

Contribution:
Separate write ordering and durability. Explains the need for a new consistency model by explaining the disadvantages of pessimistic and probabilistic crash consistency approaches. Avoiding flushing of buffered writes (pessimistic) for ordering of writes. osync() which provides ordering between writes and dsync() which provides durable writes, to decouple ordering from durability. Two important concepts introduced in this paper are 1. Asynchronous Durability Notifications to delay checkpointing and 2. Checksum to avoid write ordering. Now the paper describes the optimistic techniques used to provide the consistency such as “In-order Journal Recovery” which provides sequential recovery of transactions, “In-order journal release” which ensures transaction journals are not released until the metadata is durable, “Checksums” to avoid flushing {D | Jm - > Jc - > M}, or, {D - > Jm | Jc - > M}, “Reuse after notification” which ensures blocks are reused only after checkpointing, “Selective data journaling” which ensures journaling of data/metadata for frequently used blocks. Overall, great idea! of delaying the actual writes to providing high application performance.

Evaluation:
First, the paper provides details on the probability of inconsistencies in various systems in the event of a crash to emphasize on the need for a new crash consistency model. As expected, write heavy workloads with a high fraction of random writes showed highest probability of crash. Other factors included queue size, distance between data block and journal block. OptFS used for implementing the Optimistic Crash Consistency approach. Verifies that it recovers for 400 different crash scenarios to check for correctness. Various file system images used for evaluation. Evaluated using micro-benchmarks (Sequential Write, Sequential Overwrite, Random Write, Create Files). For random writes and file creation, 2x-3x performance compared to ext4 due to conversion to sequential journals. OptFS shows 7x better performance compared to ext4 in case of Varmail which calls fsync frequently (more flushes) due to delayed writes. But, CPU utilization increased from 3% to 25% due to checksum compute cost. Memory usage increased from 486MB to 789MB due to longer metadata holding period. Paper also provides proof for guaranteed crash recovery along with performace >- probabilistic crash consistency for applications (gedit, sqlite) using osyc(). But since CPU utilization is higher, how does this impact on the system on a whole especially while running several applications simultaneously? But overall, I liked the evaluation section as well since it was very concrete. Paper could have mentioned about various ways that the “Asynchronous Durability Notification” could have been simulated and why they chose the durability timeout for evaluation.

Issues:
How do systems currently handle crashes in the probabilistic approach (I understand they don't but was just thinking of what if they need to)? Is this approach used in any of the modern FSs?

1. Summary
The author in the paper introduce optimistic crash consistency and demonstrate with their implementation of OptFS how to build an optimistic commit protocol that correctly recovers from crashes and delivers high performance. A new approach to crash consistency in journaling file systems which separates durability and ordering to achieve high performance without sacrificing on crash recovery.

2. Problem
Modern file systems use write buffering and so because of disk schdeulers the disk writes can be completely out of order. A crash during writes to one of disk structures can lead filesystem to an inconsistent state because of out of order scheduling of these writes.
In modern drives expensive cache flush operations are used which flsuh out the dirty data immediately to disk. But this flushing makes I/O scheduling inefficient as there are few requests to choose from and also forces all previous pending writes to disk. Flushing also conflates durability and ordering. To optimise the performance users disable flusing. The authors attempt to guarantee crash consistency with the performance of disabled flushing.

3. Contribution
The authors have demonstrated that deterministic consistency can be provided if ordering and durability is decoupled. Where as in traditional file systems the data blocks(D), journal writes(Jm) and block commit (Jc) are ordered with cache flush, in Optimistic crash consistency model they can be written out of order. Ordering is enforced at the time of writing meta data M. For consistency checksum is used at appropriate places and during recovery used to validate the data & meta data blocks.
An asynchronous durability notification signal is used to notify file system about the durability of the transaction.
The checkpointing is delayed until the durability of the transaction. For an incomplete transaction, all subsequent logs are discarded. This makes sure that meta data in i+1 is only observed if i is there.
With delayed checkpointing the data block is reusable only after the transaction is durable and ensures that the metadata does not point to invalid data.
Two new primitives, osync() and dsync() are introduced, Osync ensures ordering between writes but does not deal with durability. For applications requiring durability, dsync provides both ordering and durability. Finally the authors discuss their implementation inside Linux 3.2 as a variant of ext4 FS.

4. Evaluation
The authors to test the reliability run two synthetic benchmarks, one which appends and the other overwrites blocks of an existing file. OptFS recovers correctly for nearly 400 transactions.
They run micro benchmarks (Sequential Write, Sequential Overwrite, Random Write and Create Files) and macro benchmarks (Fileserver, Webserver, Varmail and MySQL) to measure the performance.
The performance is compared to ext4 (ordered mode with flushes) and ext4 (without flushes). Sequential write performance is similar while sequential overwrite performance is poorer because every data block is written twice. Random writes are faster as OptFS converts them into sequential journal writes.
Varmail’s frequent fsync call causes a significant number of flushes and thus OptFS performs better. For MySQL, OptFS performs better than ordered ext4 while worse than ext4 without flushes because of in place writes.
They introduce two case studies to show how OptFS ordering provides considerable application level crash consistency. OptFS is approximately 40x and 10x faster per operation than ext4 without flushes for gedit and SQLite respectively.
Overall I think the paper did a good evaluation study for the proposal.

5. Questions
Can we discuss the scenario when only sync is needed and dsync isn't good ?

1. summary
This paper is about optimistic crash consistency(OCC) which decouples ordering and durability to provide high performance and crash consistency to the file system with minimal changes to the disk interface and journaling layer.This is achieved by introducing two new file system primitives as well as the techniques of checksums and asynchronous durability notification.
2. Problem
Cache flushing is the process of immediately writing all dirty buffered data to disk.This is expensive and hinders the disk from optimizing for performance.On the other hand ,disabling flushing introduces the possibility of inconsistency in case of a crash.OCC aims to bridge the gap between the performance of the probabilistic approach(latter) and crash consistency of the pessimistic approach(former).
3. Contributions
OCC removes the need for ordering writes during a transaction commit by using transactional checksums which generalize metadata checksums to include data blocks.Transaction is discarded during recovery upon checksum mismatch. Checksums enforce two orderings : that the journal commit block persists only after metadata to the journal and after data blocks to their inplace location.In order to delay checkpointing a transaction until it has been committed durably,asynchronous durability notifications(ADN) and durability timeouts are used.In this technique the disk will inform the upper level client when a write has been persisted.

During inorder journal recovery , if any part of a transaction was not completely made durable then neither that nor any subsequent transaction is left durable.During inorder journal release , a transaction is freed after it has been notified by the disk that the corresponding checkpoint writes are durable.Through the technique of background write after notification it is ensured that checkpoint of metadata occurs after the preceding writes to the data and journal.A blocks can be reused across transactions only if it is durable free.Selective data journaling addresses a special case of update dependency wherein data block is overwritten in a file and metadata for that file must be updated consistently.To cater to applications that need that need to force writes to stable storage for durability , two new primitive osync(guarantee ordering between writes) and dsync(guarantee write have been persisted) have been provided.
4. Evaluation
The optimistic file system(OptFS) is used to perform the evaluation.Two synthetic workloads are used to generate 400 different crash scenarios to evaluate the reliability,it is shown that in each case OptFS recovers correctly to a consistent file system.Micro and macro benchmarks are used to evaluate the performance of OptFS.It outperforms ordered mode with flushes on most workloads(random write , create files , varmail) providing the same level of consistency at lesser cost. On other workloads, OptFS performs as well as ordered mode without flushes(sequential write , fileserver , webserver). OptFS is not suited for workloads which consist mainly of sequential overwrites because OptFS writes every block twice.The performance evaluation is clear and complete as it provides an explanation about the observed results in each case. OptFS consumes 10% more CPU than ext4 ordered mode without flushes due to data checksumming and the background thread to free data blocks with expired durability timeouts.It is shown that OptFS performs well with reasonably sized journals of 512MB or greater , performance drops for smaller journal size.
5. Confusion
OptFS performs well on a range of common workloads and performs worse than ordered mode with flushes for much fewer workloads.Have the techniques used by OptFS been used by any previous/upcoming file system ?


Summary:
This paper presents a new approach to crash consistency in journaling file systems that uses a range of novel techniques to obtain both a high level of consistency and excellent performance in the common case. It presents 2 new file system primitives osync() and dsync() which decouple ordering from durability. This idea is developed in Linux ext4 variant and is called OptFS. OptFS improves performance for many workloads while maintaining correctness.

Problem:
Providing both performance and consistency is the primary problem in simple terms. For maintaining order most filesystems end up flushing writes to disk very frequently hence combing "ordering" and "durability". Here the idea is to decouple to two.

Contributions:
The main contribution of the paper is the idea that "Eventual durability maintains consistency while trading freshness for increased performance". The addition of a much less expensive primitive osync() to order application writes.
Some other contributions include:
Asynchronous durability notification is the key feature. The idea to allow the disk to perform read and writes in the order that optimizes its scheduling and performance. The disk has to just give a notification that specific write request has completed and is now guaranteed to be durable.
The concept of data transactional check-summing to abort transactions upon mismatch hence enabling optimistic journaling to ensure that meta data was not check-pointed if corresponding data was not durably written
The techniques of "reuse after notification" and "selective data journaling" are in my opinion are good contributors towards maintaining consistencies. Reuse after notification essentially ensures that durable metadata from earlier transaction never points to incorrect data blocks changes in later transactions by allocating "durable free" data blocks which in-turn become easy due to asynchronous durability notification. Selective data journaling is if update-in-place is desired vs the update in a "durable free block" as per resuse after notification.

Evaluation:

The OptFS implementation is tested for reliability and performance. The experiments were performed on Intel i3-2100 CPU with 16 GB of memory, a Hitachi DeskStar 7k1000.B 1 TB drive, and running Linux 3.2.
The osync() primitive has been evaluated using two synthetic workloads written to stress the OptFS machinery. These workloads clearly show that OPTFS recovers correctly in all crash scenarios. However, the dsync() primitive introduced earlier has not been evaluated for correctness or performance. Similar analysis is done for applications like gedit and SQLite by modifying fsync with osync. This analysis brings out a basic tradeoff that OpTFS increases performance even with ensuring the orderings but delays durability. The recovered snapshot after crash in case of OpTFS is older than ext4 with fsync() depending on fsync boundary. However numbers for "how old" or how delayed the durability is has not been evaluated.
Performance of OPTFS is clearly integer times that of ext4 even with flushes on most workloads. This has been extensively evaluated for many micro-benchmarks and macro-benchmarks. The authors clearly evaluate and state the major drawback of OPTFS also that is benchmarks with mainly sequential overwrites is not suited for OPTFS due selective data journaling.
Another important reliability factor not discussed in the paper is: if all the journals happen out of order and the in-order recovery is what maintains correctness then how is read after write consistency maintained (without a crash). Because if 2 M's get updated out of order the system when tries to read data may/maynot see latest write on it. Inconsistencies just with respect to crash scenarios have been tested.
Overhead of data check-summing discussed previously has also not been evaluated. As discussed in the paper data check-summing is more involved than metadata transactional check-summing.

Conclusion:
In comparison to the asynchronous durability notification which OPTFS provides her what the general disk interface notifications provided to the OS to mark completion of block I/O stages?
Can we discuss FUA in more detail?

Summary
This paper describes the design and implementation of optimistic crash consistency in journaling file systems as a new approach to achieve high consistency similar to pessimistic journaling, while at the same time maintaining equal high performance guarantees provided by probabilistic journaling methods for these file systems.

Problem
Up until now, there has always been a trade-off for optimizing between high throughput disk writes versus maintaining data consistency in presence of crashes. On one hand, pessimistic journaling achieves high consistency by enforcing strict write ordering through expensive cache flushes, but it suffers from poor disk-write throughput performance. On the other hand, probabilistic journaling optimizes for high disk-write throughput, but relaxes the ordering guarantees even to the extent that a disk may become inconsistent after a crash in some cases. The authors propose that it is possible to achieve the best of the both worlds through an optimistic journaling system.

Contributions
According to me, the following are the novel contributions of this paper:
(1) A detailed analysis that quantifies probabilistic consistency and highlights the key factors- workloads, queue size and journal layout that favor or harm consistency guarantees of probabilistic journaling.
(2) Use of metadata transactional checksumming for recovery to remove the need for write ordering that enables the transactions to be committed out-of-order, unlike pessimistic journaling.
(3) Extension of disk interface through asynchronous durability notification that gives the ability to disk to optimize its scheduling and performance, while at the same informing the file system asynchronously whenever the writes are made durable.
(4) Use of several precisely designed optimistic techniques like in-order journal recovery, in-order journal release, checksums etc. that achieves optimistic consistency without modifying any file system data structures.
(5) Another key idea introduced in the paper is of reuse after notification, in which the asynchronous notifications from the disk can help the file system to maintain a list of “durably-free” data blocks. This solves the issue of re-using discarded blocks across transactions without introducing security loopholes.
(6) To avoid large copy-on-write expense of heavily reused blocks, the authors also propose selective data journaling as an alternative that can preserve locality benefits of update-in-place file systems.
(7) To allow applications to easily choose between durability versus consistency guarantees, the authors also introduce two new file system operation primitive- osync() and dsync() that provide eventual and strong guarantees respectively.

Evaluation
The authors have evaluated their ideas through implementation of an Optimistic File System (OptFS) for Linux 3.2, as a variant of the ext4 file system, with additional changes to the JBD2 journaling layer and virtual memory subsystem. This file system implementation is then evaluated on two parameters: reliability and performance.
To evaluate reliability guarantees provided by OptFS, two synthetic workloads- one appending blocks to the end of the file system and the other randomly modifying the blocks of the file system are used. In both the cases, OptFS is always able to recover correctly a file system with consistent prefix of updates in 400 different crash scenarios.
To evaluate the performance provided by OptFS, the authors use a number of micro- and macro- benchmarks. Using the micro-benchmarks, the authors show that OptFS has same performance as ext4 for sequential writes while achieves 3x performance gain for random writes. On the Createfiles benchmark, OptFS is twice as much faster as ext4. For macro-benchmarking, the authors run Filebench, WebServer and Varmail workloads to show that OptFS can achieve significant performance benefit over ext4 by avoiding expensive flushes to disk.

Overall, I believe that the authors have presented a wholesome evaluation to validate all of their design goals for optimistic journaling. My concern was regarding the overheads associated with checksumming & selective data journaling, and the authors do admit that the resource consumption for OptFS may be more than ext4. Hence, optimistic journaling is not a good solution for resource or energy constrained systems. In fact, the performance of OptFS also degrades for small journal sizes. One thing, which I think that the authors have not talked much in the evaluation section is- the evaluation of the mechanisms to implement the selective data journaling policy. Since data journaling may become prohibitively expensive, there may be a couple of workloads that will respond negatively to it and therefore, it would be interesting to see how to choose mechanisms to prevent this. 

Confusion
What information is exactly stored at checkpoints in a typical journaling system and what it means to have in-place checkpoints on disk?

1. Summary
This paper introduces optimistic crash consistency, a new approach to achieve crash consistency in journaling file systems with better performance than traditional ones, which they call “pessimistic” crash consistency. The idea is implemented as a file system called optFS on Linux based on ext4, and exposes minimal modification to the disk and file system interface.
2. Problem
The crash consistency problem on file system arises from the write buffering and out-of-order write completion features of disks, which aim to optimize for I/O performance. These features, however, leaves the file system in the risk of inconsistency upon crash since without flushing the file system can only know when the request is received, but not when it is guaranteed to be durable on disk. Some journaling systems avoid the problem by flushing commands to make sure they are in order, but this technique can greatly degrade the performance of the system. Others simply accepts this out-of-order feature, trading off correctness for performance, which can be expensive in particular workloads.
3. Contributions
The goal of optimistic crash consistency is to guarantee system consistency after crash while maintain the performance boost of out-of-order write completion. It is achieved by using an array of techniques implemented on the disk and journaling layer:
i. It is difficult to maintain consistency of certain journaling actions when they are out of order without knowing any status of the requests after they are sent to disk. Thus the paper proposes an extension to the current disk interface which can asynchronously notify the clients upon completion of any write request.
ii. With the new disk interface, journaling of data and meta can be divided into transactions and checkpoint can be done upon completion of journaling of metadata in the transaction. The recovery can then replay the transactions with valid checksum, and discard the other.
iii. Checkpointing write is done after certain journaling writes are notified to be durable, but this write does not impact performance given that different checkpoints in different transactions are independent and thus can be executed out of order.
4. Evaluation
The OptFS is evaluated on an Intel Core i3-2100 CPU with 16GB of memory and 1 TB drive running Linux 3.2. The reliability is tested based on the result of 400 different crash scenarios and OptFS successfully recovered to a consistent state after each of them. The performance is measured based on both specific read-write patterns and realistic workloads. Except for sequential overwrite, in which case OptFS achieves only half of the bandwidth by writing each block twice, and sequential write when performances are identical, OptFS achieves significant performance gains in the tests compared to the ordered mode of ext4.
5. Confusion
In the performance comparison, how does OptFS achieve better performance than the out of order mode of ext4?

1. Summary
This paper introduces a new approach towards achieving crash consistency in journaling file systems that can achieve high performance at the cost of delayed durability. The paper proposes two new file system primitives osync() and dsync() that decouples ordering of writes from their durability.

2. Problem
Modern disks buffer writes in memory instead of persisting them on disk even after the completion of write requests. Disks may also perform these writes completely out of order. To achieve crash consistency, modern journaling file systems follow update protocols that forces ordering among the updates of different on-disk structures via the issue of flush commands that forces the dirty buffers onto disk. Hence, journaling file systems provides crash consistency at the cost of performance. We are left with two choices: no crash consistency and high performance vs crash consistency and low performance. Practitioners preferred the former and adopted file system configurations that provide no consistency by ignoring flushes while achieving very high performance on the fact that crashes are a rare event. Moreover, applications are forced to use fsync() to achieve ordering between two write() requests in their update protocols.

3. Contributions
i) The main contribution of the paper is in the realization that ordering and durability are fundamentally different and neither file system nor applications should be forced to make writes durable on disks to achieve ordering of writes. ii) This realization resulted in two novel contributions: a. A novel optimistic crash consistent journaling file system that achieves both high performance and crash consistency at the cost of delayed durability. b. Proposal of two novel and new primitives to the file system API - osync() and dsync() that can be used by applications to decouple ordering and durability thereby achieving high performance.

iii) By a combination of metadata and data checksum journaling, the forcing of ordering of writes between data blocks (D), journal metadata (Jm) blocks and journal commit (Jc) sector is removed, thus resulting in improved performance due to removal of one flush operation that costs 10s of milliseconds in the synchronous path. iv) Performance is further improved by removing the flush of D, Jm and Jc before the in-place metadata (M) update. The write of M is delayed till D, Jm and Jc are persisted on disk. To achieve this the paper proposes an improved disk interface that asynchronously notifies the FS after a write is persisted on disk. The delayed write of M doesn’t have a negative impact on performance as it can happen in the background after acknowledging the application on the completion of write. The removal of flush after D, Jm and Jc might cause a data loss if the system were to crash when one or more of these blocks are in memory. But the file system is guaranteed to be consistent thus achieving high performance and crash consistency at the cost of delayed durability v) The paper also proposes an optimistic data journaling version, that selectively journals data blocks that are repeatedly overwritten. vi) Another contribution of the paper is in the study of probabilistic crash consistency that tries to quantify the probability of inconsistency by measuring the window of vulnerability for different workloads and different patterns of workloads.

4. Evaluation
The paper motivates their approach through an evaluation of a file benchmark that shows the huge performance difference in orders of magnitude between ext4 ordered mode with flushes enabled and disabled. The evaluation is thorough due to the fact apart from just showing that their approach improved file system performance, the authors also try to prove that the proposed file system is crash consistent which is an important dimension to be evaluated here. The experiments to evaluate performance involved both micro and macro benchmarks and shows that their system provides significantly improved performance over ext4 ordered mode with flushes and comparable to or sometimes better performant than ext4 ordered mode with no flushes. The evaluation also shows a slight increase in CPU and memory overhead. Since OptFS provides consistency guarantees more than ext4 ordered journaling mode due to selective data journaling, a performance comparison and crash consistency comparison with ext4 data journaling mode might have been appropriate and might have show performance improvements in the case of sequential overwrite benchmark. The authors also evaluate the performance and the consistency of application state for two applications using the new interface with osync() and dsync() instead of fsync. Random crashes during the update protocol never resulted in an inconsistent state but resulted in old state more often than new state. In their implementation, I suppose the applications can lose 30s worth of data corresponding to the delay (D) but are always left with a consistent persistent state.

5. Confusion
The selective data journaling in OptFS journals only data blocks that are repeatedly overwritten. What exactly is the implication of repeatedly here? Does it provide the same consistency guarantees as that of ext4-data journaling mode?

1. Summary
This paper describes a new file system that uses checksums, a new disk interface, and selective data journalling to maintain consistency without waiting for multiple flushes to disk. The paper also evaluates existing file system techniques.
2. Problem
Many disks use write buffering, in which disks store data in a buffer on disk and then write those data out in some arbitrary order. To achieve a particular order, many file systems resorted to flushing the data to disk. However, this was expensive and conflated ordering and durability. Thus, many users chose to disable flushes, finding that often their data remained correct, even in the face of a system crash. However, this left them with guarantee of consistency.
3. Contribution
The paper begins by evaluating the existing file system techniques. It finds that while some checksumming can increase the speed of file systems that use flushing, disabling flushing increases it by much more. With a file system with flushing disabled, they evaluate the probability of inconsistency for six real-world workloads. They find, first, that read-oriented workloads have very little chance of inconsistency. However, workloads with a high number of random writes have a high chance of inconsistency. The authors also find that the probability of inconsistency can have high variance. The early commit of journal before the commit of data is the most common cause of inconsistency. While the depth of the disk scheduler queue matters, even small queues can lead to a great amount of inconsistency.

The paper describes a new system that provides consistency guarantees while still being faster than using flushing to order data writes. It first proposes a new disk interface, for asynchronous durability notification. The disk would first send a notification when it has received data and then would send the durability notification when the data has been written to disk. This would allow applications to work while the data is written to disk.

The paper also describes various techniques for optimistic journaling. One is changing the journal recovery to stop restoring data at the first transaction that is not complete upon disk. This provides data that may not be current, but is an accurate snapshot of some set of updates to the data. To determine the completeness, it uses checksums for both data and metadata. Another technique is preventing the checkpoints of metadata from being written only until receipt of the asynchronous durability notification. To provide update-in-place, the paper describes the technique of selectively journaling data when it is overwritten multiple times.

The authors also implemented this system inside Linux 3.2. Since no disk supported asynchronous durability notifications, they used durability timeouts as an approximation.

4. Evaluation
The authors evaluated their implementation for reliability, performance, resource consumption, and journal size. They found that the file system recovered correctly after 400 different crash scenarios. They tested the performance of the file system on both micro- and macro-benchmarks. They find that their file system is 2-3 times faster than ext4 on the micro-benchmarks. On various macro-benchmarks, the file system performs 7 times and 10 times faster. This comes at the cost, however, of substantially more CPU time, as well as additional memory usage. They found that the file system performs well with journals of 512 MB or greater.

This evaluation seems fair. The performance is improved, but none of the increases in performance are by more than a factor of 10, and it comes at some resource-usage cost. They demonstrate the usage on realistic workloads similar to those that Linux systems often run. On those workloads, the performance benefits are better than for the micro-benchmarks.

5. Confusion
Could you go into more detail about how crash simulation is performed?

1 Summary
Because modern disk controllers can arbitrarily reorder writes and delay persistence, many file systems must force a flush of disk buffers to ensure ordering, degrading performance. The authors construct a file system whose performance approaches that of a flush-free file system by taking advantage of checksums and a new asynchronous durability notification.
2 Problem
Modern hard disks provide a conventional write interface; the system issues a write request, and the drive acknowledges this request. However, disk controllers have increased in sophistication over the past several decades, and disk hardware implements sophisticated os-like queueing and buffering strategies to maximize write throughput. Unfortunately, this has unfortunate implications for OS-level persistence strategies. An acknowledgement from the disk controller simply means that the write has been received and buffered - nothing can be inferred about the durability of the data. Moreover, in attempting to minimize seek times when writing to the disk platter writes can be reordered arbitrarily, and order-dependent metadata may be written in an order contradicting that specified by the write requests. By correctly placing requests for cache flushes or forced direct access, the file system can enforce ordered metadata writes. However flushes and forced access limit the pool of available blocks to the disk scheduler, and impede read accesses, degrading performance.
3 Contributions
The authors decouple ordering and durability, proposing two new file system calls, osync(), which guarantees ordering with eventual durability, and dsync() which promises immediate durability. Noting that, in the common case, applications simply wait for a synchronous notification that a write has been pushed to the disk cache, and that checkpointing of metadata occurs in the background, the authors propose a new asynchronous durability notification, which notifies the file system when a write finally becomes durable. Moreover, the authors extend the use of checksumming to data as well as metadata, as checksums remove the order dependency of journal commits and the corresponding writes. Via these two optimizations, the degree of order dependence is greatly reduced, and the disk controller is generally free to make ordering decisions over larger buffers, increasing throughput.
4 Evaluation
I liked that the authors created an "ideal" performance model, in the form of a file system which does not provide ordering or consistency guarantees - It clearly delineates the upper bound on performance. In particular, the authors show that a.) The flushing file system implementation is 4x slower than the potentially inconsistent version and b.) for workloads that are consistency-critical, such as a DBMS, the odds of inconsistency on crash are ~60%. I was disappointed in their evaluation of the optimistic strategy - when you're discussing guarantees of this sort, I'd rather see an outline of a proof of correctness. For the empirical tests, I was impressed that their 30 second approximation of asynchronous notifications eliminated inconsistency, though.

5 Confusion I am still having trouble understanding selective data journaling, and when it is or is not used.

1. Summary
This paper presents the design and implementation of optimistic crash consistency, an approach for journaling file systems that guarantees crash consistency without incurring performance degradation that traditional file systems guaranteeing crash consistency suffer from.
2. Problem
Write buffering in modern disks complicates the problem of crash consistency. Most systems rely on cache flush operation to implement strong consistency. Such pessimistic systems degrade performance due to several reasons - cache flush is used to order two specific writes with respect to each other, but ends up flushing a bunch of other writes to disk unnecessarily; scheduling of disk requests is constrained which is inefficient; crashes are rare but traditional FSs assume them not to be rare and thus a lot of work is performed assuming crash can occur at any inopportune moment. On the other hand, many FSs do not use cache flush for high performance, but then can’t guarantee crash consistency which is the required in many cases. In this work, authors aim to achieve the best of both worlds - strong consistency with high performance.
3. Contributions
The biggest contribution of this work is that they decouple the concepts of ordering and durability. Crash consistency basically requires strict ordering between certain writes (for example, data and metadata) to disk and not immediate in-order durability of these writes. Traditional systems use cache flush to implement strict ordering, thus conflating ordering with durability. Systems that do not use cache flush offer probabilistic crash consistency. The authors have quantified the consistency guarantees of such systems using the metric of probability of inconsistency (Pinc) on crash. They have analyzed how these systems seemingly offer strong consistency in many cases while being at high risk of inconsistency in other cases.
The key idea is to use mechanisms that do not flush cache, but still offer ordering guarantees. They use checksums to ensure a FS update has been journaled in persistent memory in its entirety. Further, they use asynchronous durability notification (ADN) to ensure checkpointing is done at right times. ADN is a new interface exposed by disk required to support the proposed design. The authors clearly define the guarantees offered by the file system and then go on to describe techniques that will be used to ensure those guarantees. In-order journal recovery, in-order journal release and checksums are basic techniques used for consistent journaling. Background writes after ADN are used to update checkpoints lazily. For the special case of block reuse, they implement “reuse after notification”. The design also uses selective data journaling to optimize for contiguity in cases when data blocks are repeatedly overwritten within same file. To decouple ordering and durability, two new calls are provided - osync() ensures ordering between write with eventual durability, while dsync() can be used for immediate durability.
4. Evaluation
The paper has done a very good job of analyzing the probabilistic consistency approach to highlight various factors that impact the likelihood of such systems actually becoming inconsistent during a crash. Write heavy workloads with many random writes or frequent fsync() calls are more likely to cause inconsistency in case of crash - as high as 60% probability. Also size of queue holding in-fight writes, and distance between journal and data locations directly impact the probability of inconsistency. These results motivate the need for strong crash consistency guarantees.
Optimistic FS has been implemented in OptFS, an extension of ext4. Using microbenchmarks, they show that OptFS performance is 2x to 3x compared to ext4 for random writes and createfile. Varmail benchmark has frequent fsync calls and thus OptFS performs 7x better than ext4. On the other hand, OptFS performs poorly for workloads with sequential overwrites due to selective data journaling. They also show that OptFS increases CPU utilization from 3% to 25% and memory usage from 486MB to 749MB. Lastly, the authors use modified gedit and SQLite applications to demonstrate that OptFS gives 100% crash recovery guarantee with a performance at least as good as systems without cache flush.
Though mostly satisfactory, there are few questions that should been evaluated. Does increase in CPU usage and memory pressure impact application performance? Since the checkpointing is delayed due to ADN, what is its performance impact? They used one fixed value of durability timeout to mimic ADN; evaluating for several different timeout values could show interesting results.
5. Confusion
Why do authors talk about VM sub-system while talking about background writes?
In ADN, would it suffice to just have one final asynchronous notification without any acknowledgement?
How does system decide to use selective data journaling in certain cases?
Authors talk about applications using osync() and dsync(). Can you give some examples of applications where each of those would need to be used?

Summary
This paper talks about optimistic crash consistency, which provides a novel mechanism to guarantee consistent file system state without using expensive disk-cache flushes.
Problem
Modern disk drives use caching and out of order disk scheduling to obtain better I/O bandwidth from the device. However, this interferes with journalling file systems that rely on the ordering of a sequence of writes to data, journal and metadata commits. Such file systems rely on flushing the disk cache to enforce ordering among two write-sets. Excessive cache flushing reduces the benefits of I/O scheduling for the native disk scheduler. Disk reads may also suffer higher latency while waiting for a large cache flush. As a result some systems disable disk cache flushes at the expense of occasional cache inconsistency.
Contribution
The authors’ first observation is that flushing the cache for each ordering is pessimistic in that it writes out all previous dirty blocks and the required ordering may have occurred naturally. This is called pessimistic journaling. In contrast, turning cache flushing off leads to probabilistic journaling. For some workloads (varmail) probabilistic consistency can gain a 5x speed-up over pessimistic journaling.
1- The authors simulate and measure the probability of inconsistency for multiple workloads and present some interesting results. Namely, applications with random writes and a large number of fsync’s have a higher chance of being inconsistent. Most of this inconsistency is due to re-ordering of the data and the journal commit aka early commit and it increases with deeper request queues.
2- A key idea in the design is to separate ordering from durability. Optimistic crash consistency relies on an extended checksum to relax the ordering between data, journal writes and journal commits. Secondly, it uses an extension to the disk interface viz an asynchronous notification of when a write is persisted on the disk (and not the cache). Once the durability notifications for the the data, journal and journal commit are observed and previous transactions have committed it is safe to make an in-place commit.
3- For achieving consistency the optimistic FS must use in-order journal recovery and a transactional checksum that covers all the data blocks. The FS must wait for durability notifications before writing in-place updates but this can happen in the background without stalling the application. These are emulated using durability timeouts.
4- With optimistic consistency if a data block is re-used before the freeing transaction commits it is possible to have the data clobbered on a crash without the freeing transaction getting committed. The solution is to delay adding a data block to free-list until the block is so called “durably free”. Similarly, block overwriting are problematic in journaling file systems. It is possible to re-map the data block but that could reduce contiguity in the file. The authors use selective data journaling that journals the data blocks(s) and writes the data during commit.
5- Lastly, a major contribution is in separating the interface for ordering and durability at the application level, this comes in the form of osync and dsync. There are sufficient use cases where applications can gain in performance by ensuring only ordering between multiple write sets to have a consistent state.
Evaluation
The authors evaluate crash consistency using a set of file-system images with re-orderings and OptFS passes all the tests. Micro-benchmark based measurements show that OptFS performs comparable to ext4 without flushing and clearly outperforms it for random writes - due to selective data journaling (sequential writes). OptFS’s delayed commits also shows benefits in applications with a large number of fsyncs like Varmail. The authors also measure the CPU overhead of using OptFS and the effect of journal size. Lastly they provide two interesting use cases for optimistic consistency in applications -gedit and sqlite- using osync call. They show that OptFS does not lead to inconsistency but more frequently restores to the old state due to the delayed commit operation. An order of magnitude speed-up is obtained using osync.
Confusion
How frequent can crashes be?
Why does keeping the journal farther away from the data lead to lower chances of inconsistency?
It wasn’t clear why meta data checkpoints Mi and Mi+1 can be written out of order?

1. Summary
This paper introduces optimistic crash consistency (OCC) and discusses an implementation of this idea, OptFS. Optimistic Crash Consistency is a new approach to crash consistency in journaling file systems and separates the notion of durability and ordering to achieve high performance without sacrificing on crash recovery.

2. Problem
Write buffering leads to out disk writes being completed out of order. A single file system operation could update multiple on-disk data structures. A crash during writes to one of these structures could lead to an inconsistent state, especially because of out of order scheduling of these writes. In order to handle this scenario, ordering is achieved via expensive cache flushing operations which lead the dirty data to be persisted immediately. However, flushing makes I/O scheduling less efficient and could block the disk for a long period of time. Flushing also combines the notion of durability with ordering and does not provide an opportunity for a client to enjoy the benefits of one while avoiding the costs of other. Given that crashes are rare, this cost for every flush is prohibitively high. In order to avoid the performance penalty of flushing, users disable flushing. While certain applications still tend to be consistent, this can lead to inconsistencies for various other classes of applications. The authors try to achieve the performance of a system with flushing disabled while still guaranteeing deterministic consistency. They do so by decoupling ordering from durability.

3. Contribution
The authors quantify probabilistic crash consistency and identify factors affecting it. The authors identify that deterministic consistency could be provided without sacrificing on performance if we decouple ordering from durability. Traditional FS achieve consistency by issuing cache-flush between writes to data blocks (D), journal writes (Jm) and committing a block (Jc). Finally, after the commit, the FS is free to update the meta data blocks in place (M). OCC allows D,Jm and Jc to be written out of order. The only time it forces an ordering is while writing M. To ensure that the system does not become inconsistent when D,Jm and Jc are being written out of order, OCC uses checksum at appropriate places. The checksum is used during crash recovery to validate data/meta data blocks.Transactions are discarded if a mismatch is found. Additional care is taken to maintain consistency when a data block could be reused across transactions. In order for the FS to be notified about the durability of a transaction (D,Jm,Jc), an asynchronous durability notification signal is used. This signal is used to ensure that checkpointing is delayed until the durability of a transaction. This requires a modification of the disk interface. During crash recovery, if the FS encounters an incomplete transaction, all subsequent logs are discarded. This ensures that metadata written in Transaction i+1 is only observed if the metadata from transaction i is also visible. By delaying checkpointing until transaction durability, ensuring data block reuse only after the transaction is durable etc ensures that the metadata does not point to invalid data. OCC provides selective data journaling for data blocks that are continuously overwritten. Thus it can perform in-place updates for suitable workloads without incurring performance overhead of writing the data block twice for the common case. This also helps maintain the disk layout of the file. OCC introduces two new FS primitives, osync() and dsync(). Osync ensures ordering between writes but does not deal with durability. For applications requiring durability, dsync provides both ordering and durability. Finally the authors discuss their implementation of OCC, Optimistic File System inside Linux 3.2 as a variant of ext4 FS.

4. Evaluation
The authors provide a comprehensive evaluation of their system. To test reliability, they run two synthetic benchmarks, one which appends and the other overwrites blocks of an existing file. For 400 different crash scenarios, OptFS recovers correctly. To measure performance, they run micro benchmarks (Sequential Write, Sequential Overwrite, Random Write and Create Files) and macro benchmarks (Fileserver, Webserver, Varmail and MySQL). The performance is compared to ext4 (ordered mode with flushes) and ext4 (without flushes). Sequential write performance is similar while sequential overwrite performance is poorer because every data block is written twice. Random writes are faster as OptFS converts them into sequential journal writes. Varmail’s frequent fsync call causes a significant number of flushes and thus OptFS performs better. For MySQL, OptFS performs better than ordered ext4 while worse than ext4 without flushes because of in place writes. Some of the benefit of OptFS performance could be attributed to the fact that OptFS delays writing of dirty block and issues them in large batches. However, OptFS consumes more CPU time and memory because of checksum calculation and holding of metadata for a longer period of time. The authors also evaluate a scenario for different journal sizes. Finally the introduce two case studies to show how OptFS ordering provides meaningful application level crash consistency. OptFS is approximately 40x and 10x faster per operation than ext4 without flushes for gedit and SQLite respectively.. However, OptFS recovers to an old state after a crash more than the others because of delayed writing of dirty blocks. However, the paper lacked a clear explanation as to why durability timeout interval was chosen to be 30 sec.

5. Questions
1. Currently, OptFS provides two notifications, first when disk has received the write and later when the write has been persisted. Can the system work if only the later is provided?
2. Could we discuss Figure 3, why are certain applications more prone to inconsistencies than others?

Summary
This paper describes a new approach to crash consistency for journaling file systems - optimistic crash consistency. This approach aims at providing high performance as that of a probabilistic crash consistency and a deterministic consistency as that of a pessimistic crash consistency approach. The paper further gives an implementation of the newer approach within a Linux ext4 journalling system with an evaluation for the same.

Problem
A smart disk scheduler could re-order the write requests sent to it by buffering the writes in order to achieve high performance. Such re-ordering of writes in disks creates problems while using recovery techniques on system crashes. The system could recover into an inconsistent state. To avoid such inconsistencies, ordered journaling is used where to maintain a order among the writes, the file system flushes the cache with dirty data immediately. Though this guarantees consistency, it drastically reduces the performance as the disks cannot optimally perform writes. An alternative approach is to use probabilistic consistency approach which basically reduces the number of flushes based on a proability. But, even if this increases the performance, it provides a weak consistency guarantee. Hence, the authors proposed a new approach to get the best of both worlds.

Contributions
a. The paper presents a new approach to crash consistency in journaling file systems by decoupling of ordering of writes and durability. Two new primitives - osync() and dsync() were introduced for the same.
b. The paper details out a study on pessimistic and probabilistic crash consistency and points out the factors that led to the development of optimistic crash consistency to bring the best of both worlds.
c. The design for a new interface of a disk is presented which provides an addition of an asynchronous durability notification from the disks to a file system to convey whether a write is completed and is persistent.
d. OptFS comprises of a variety of optimistic techniques. In-Order journal recovery is used which ensures that the recovery from the journal is done sequentially. A transaction in a journal is only released when the notification for a checkpoint of the corresponding transaction is received from the disk. Checkpointing are done in background and hence does not impact performance of applications. Checksums are used for both data blocks and journal to ensure to consistency. A data block is only considered free after a notification from the disk that confirms that the given block is durably free. Selective data journaling is used which enables data journaling only when a file is being over-written. This helps in cases where the data was not written completely and the selective property helps in reducing the number of writes significantly, thus boosting performance.

Evaluation
A bunch of micro and macro benchmarks are run over OptFS, implemented on a Linux ext4 system. The reliability for crash recovery was found 100% for over 400 crash scenarios. With the micro and macro benchmarking, OptFS outperformed ext4 ordered journaling with flushes and provided the same level of consistency with a reduce in costs. The performance compared to the ordered mode without flushes was found to be similar. Though, sequential overwrite benchmarks showed a decrease in performance for OptFS. The paper also cites the effect of different journal sizes on performance for OptFS. Two real world applications - gedit and SQLite were modified to use osync() and were studied for their performance. OptFS showed consistent states - either recovered to an old file or to a new file and with a boost in performance. Overall, the evaluation for OptFS seemed good and the comparisons with traditional recovery systems provided better insights and how well OptFS performed.

Confusion
Not very clear on where/how the checksums for data blocks are stored.

1.Summary:
This paper is about a new approach to journaling in file systems, which the authors call the optimistic crash consistency. This guarantees better consistency without trading on performance by decoupling order of writes from their durability. This design is implemented on a Linux ext4 variant called OptFS and evaluated for its reliability and performance.

2.Problem:
1) Write ordering(data[D]->journal metadata[Jm]->journal commit[Jc]->metadata[M]) is important to ensure consistency after a crash. This is achieved by cache flush operations in drives, which persists the data to the surface. This operation is expensive since it affects disk reads by making them wait for pending writes to complete, though the application may not require at all. This is the pessimistic crash consistency approach and affects performance.
2) To solve this, cache flushes are disabled and it has been noticed that system is rarely inconsistent. This improves performance and leaves consistency to chance. This does not work good for most of the applications where consistency is highly required, such as DBMS.
To solve this problem, the authors introduce optimistic crash consistency which is similar to that of probabilistic crash consistency techniques, but addresses the above problems by decoupling ordering and persistence.

3.Contributions:
Two key contributions of the paper is evaluating the probabilistic crash consistency and design of optimistic crash consistency technique.
I) Probabilistic crash consistency:
The system is tested by disabling flushing in ext4 and traced for inconsistencies. The authors quantify the probabilistic crash consistency by introducing 'window of vulnerability', which is the time period of inconsistency due to reordering of writes. Probability of inconsistency(P_inc) is total time spent in windows of vulnerability divided by the total workload time. Evaluation of factors affecting P_inc such as workload, disk scheduler queue size and journal layout has been done and show that early journal commit leads to most of the inconsistencies.
II) Optimistic crash consistency:
1) To combine the advantages of pessimistic and probabilistic crash consistencies(unique?), flushing is disabled for better performance and advantage of ordering is obtained by metadata transactional checksumming of D, Jm and Jc. On journal replay, transactions are discarded on checksum mismatch.
2) Delayed checkpointing is done through asynchronous durability notification where the disk notifies the upper level about the transaction being made durable.
3) Transactions order is preserved after a crash by reading the journal during recovery and journals discarded after checkpoint writes are persisted.
4) Selective data journaling improves performance by logging the data as well to journal. This provides the locality benefits of update-in-place file system.
5) file system primitives[to ensure writes to disk are ordered] : osync() to guarantee ordering between writes and eventual durability. dysnc() to persist pending writes.

4.Evaluations:
OptFS has been implemented on Linux 3.2 as variant of ext4 file system. ADNs are implemented as durability timeouts(before which writes are made durable).
Three questions authors answer in their evaluation :
1) File system consistency after crashes ?
Simulated crashpoints for workloads such as append and overwrite with osync() are performed and the file system state is 100% consistent after crash with OptFS, thus providing good reliability.
2) Performance of OptFS ?
Here, they evaluate using micro-benchmarks and macro-benchmarks. Micro-benchmarks - For random writes, OptFS performs 3x faster than on ext4 ordered mode with flushes due to selective data journaling. Macro-benchmarks - Filebench file-server, webserver and varmail workloads. Varmail issues many fsync, leading to OptFS perform 7x better than ext4. Overall ,OptFS performs better than ext4 with ordered mode, and as good as ext4 without flushing and providing more consistency.
3) Can application level consistency be built on top of OptFS ?
Evaluated on Gedit for atomic updates by replacing calls to fysnc by osync. Atomic updates are achieved, but ext4 performed better since OptFS delays writing(trading freshness for performance). For SQLite, OptFS always results in consistent state when compared to ext4 without flushes and performs 10x better than ext4 with flushes.
Overall, good evaluation of the all the features of the system and quantitative comparison to pessimistic and probabilistic crash consistency techniques with good reasoning.

5.Confusion:
Exact difference between FUA and fsync.

1. Summary
In this paper, the authors introduce optimistic crash consistency for providing both consistency and performance in journaling file systems.The system developed, called OptFS, provides new file-system primitives, osync() and dsync() for decoupling ordering of writes from their durability thereby trading freshness for performance while maintaining crash consistency.
2. Problem
The existing system are not efficient in providing both performance and consistency due to employing cache flush operations that are expensive and prohibitive. These systems conflate ordering and durability. Many system have choose to not guarantee crash consistency due to this performance tradeoff.
3. Contribution
The main contribution in this paper is the OptFS file system which provides strong consistency and provides high performance. The authors follow the traditional research approach in systems where the existing systems are thoroughly studied and performance bottlenecks are identified. After this, the authors provide solutions to address these bottlenecks. The present journaling systems require ordered journaling and have a probabilistic crash consistency. OptFS introduces primitives, osync() and dsync(). osync() ensures ordering between writes but only eventual durability and dsync() ensures immediate durability as well as ordering. Slight changes to the disk interface for providing asynchronous durability notification is made. ADN is a notification for when a write is persisted in addition to when the write has been simply been received by the disk. OptFS eliminates flushes and re-orders blocks. Optimistic crash consistency is handled by either detecting re-ordering after crash or preventing the re-ordering from occurring. ADNs increase disk freedom as the blocks may be destine in any order and any time. It uses ADNs to control what blocks are dirty at the same time in disk cache. Re-ordering is detected using checksums. The checksums are computed over data and metadata and checked during recovery, a mismatch indicating lost blocks during crash. They also employ delayed writes to prevent reordering. Other techniques like in-order journal recovery and release, reuse after notification and selective data journaling are employed.
4. Evaluation
The authors project the simplicity in implementing the system as around only 3000 lines of code was modified/added. The modifications were to the journaling layer and virtual memory subsystem. ADNs were emulated using timeouts. They observe that OptFS remains consistent after 400 random crashes and it performs 4-10x better than ext4 with flushes. They study the performance of gedit and SQLite on OptFS to evaluate whether meaningful application-level consistency be built on top of OptFS. The application were run, the writes traced, re-ordered, dropped or replayed and the application state on new image was examined. With SQLite, there were zero inconsistencies. osync() changes the semantics from ACID to ACI-(Eventual Durability). I feel that the authors have evaluated the system keeping in mind their initial goals and also evaluated the real world efficiency by checking its performance with applications like gedit and SQLite which have a varied and intensive I/O read-write demands. They successfully show that eventual durability maintains consistency while trading freshness for increased performance.
5. Confusion
More details on FUA would be good. What is the current industry value of the proposed system?

Summary
The paper describes optimistic crash consistency, a journaling technique which achieves separation between the ordering and durability requirements for disk writes. This separation allows an optimistic journaling file system to obtain the the performance benefits of probabilistic crash consistency while also ensuring the consistency semantics of pessimistic crash consistency.

Problem
To handle the crash consistency of journal transactions in systems supporting write buffering, pessimistic journaling approaches relied on the use of expensive flush operations or Forced Unit Access (FUA) operations. Flushing uses durability as a means to achieve write ordering and therefore mixes durability with ordering. An application may only require one of these. This flushing overhead leads to lost opportunities for speeding up disk writes (by re-ordering), forcing writes for blocks which did not have immediate durability requirements, and high latency for read operations when they got stalled by a flush operation. This overhead was endured despite knowing that crashes were a rare phenomenon. To remove the overhead of flushing, some journaling file systems removed it altogether (probabilistic crash consistency), sacrificing crash consistency for performance. Earlier approaches at tackling this problem such as Soft Updates, Featherstitch, NoFS, Backpointer-based consistency, etc. were not able to provide support for transactions, high performance, consistency, durability and immediate durability-on-demand while being flush-free in nature.

Contribution
Optimistic crash consistency's journaling achieves high performance by allowing the disk to persist the cached writes in any order of its choice, while data consistency is maintained by avoiding situations that may lead to a file system inconsistency after a crash, or persisting sufficient state that allows it to successfully recover from an inconsistency following a crash. Data transactional checksumming ensures that writes to commit the journal transaction (in-place / journaled data, journaled metadata and journal commit blocks) can occur in any order while ensuring consistency. Optimistic journaling also requires the disk to support an asynchronous durability notification, which notifies the file system of the persistence of an asynchronous disk write. The asynchronous durability notification of a completed transaction journal commit allows checkpointing for that and subsequent transactions (background write after notification). On a crash, optimistic journaling performs an in-order recovery by sequentially scanning the journal and performing checkpoints in-order and stopping at the first transaction that had a mismatching checksum. On completion of transaction (checkpoint writes are persisted), journal entries are freed in order from the journal area of the disk. The reuse after notification technique ensures that 'durable-only' blocks are used when a new block is to be allocated to a file. The selective data journaling technique journals both data and metadata for overwritten data blocks (JD|JM|JC → M,D, where JD, JM and JC are checksummed), while only journaling metadata for newly allocated blocks (D|JM|JC →M, where D, JM and JC are checksummed). This ensures data consistency for overwritten blocks and performance for newly created blocks, which do not incur the overhead of an extra data copy incurred in data journaling. The optimistic journaling system also exposes the osync() and dysnc() system calls for applications to manage their ordering and durability requirements. osync() offers ordering but eventual durability, while dsync() offers immediate synchronous durability.

Evaluation
An initial analysis between the regular ext4 system (that used flushes) and its counterpart that did not use flusing was performed to highlight the performance superiority of an approach that has cache flushing disabled. The authors evaluate the probability of an inconsistency for the probabilistic crash consistency by studying how it varies with the workload, queue size and journal layout. This probability was found to be significant for write-intensive workloads that had random writes and made frequent use of the fsync() calls. Also, early commit before data (JC → D) was found to be the chief cause of write re-ordering. While increasing the queue size was found to increase performance, it also led to higher chances of inconsistency. Finally, it was found that increasing the distance between the journal and checkpoint regions led to lower chance of an inconsistency, but also degraded performance to a small extent.

OptFS, a prototype of the optimistic journaling was implemented by making changes to the ext4 file system. As no contemporary disk supported asynchronous durability notifications, OptFS instead used durability timeouts to simulate them. OptFS was first tested for reliability by testing on it a crash-robustness framework that consisted of append and overwrite workloads and was found to successfully recover from all kinds of simulated disk failures that caused the file system to be in an inconsistent state. The performance of OptFS, ext4 with flush and ext4 without flush was compared on micro-benchmarks and macro-benchmarks, which revealed that OptFS performed similarly with the other file systems for workloads requiring sequential writes, while it outperformed them for workloads needing random writes. However, OptFS performed poorly for workloads consisting of sequential overwrites due to the overhead of data journaling. An analysis of resource consumption on the VarMail workload showed that OptFS has higher CPU and memory consumption that its counterparts. This overhead was attributed to the expensive CRC32 data checksumming and the policy of Opt32 to hold metadata in memory longer. Another analysis on the MySQL OLTP benchmark shows that OptFS degrades in performance only for smaller journal sizes.

The evaluation also consisted of building applications on top of the three files systems (OptFS, ext4 with flush and ext4 without flush) and verifying their performance and crash consistency behavior. As expected, OptFS gave high performance and restored file system consistency after a crash. The competing ext4 variants could only achieve one of these properties.

Overall, I believe that OptFS was throughly tested in all relevant areas to successfully demonstrate that it had achieved its design objectives of high performance and reliable crash consistency at the cost of reasonable memory and CPU overhead.

Question / confusion
1. How does ext4 allow D|JM|JC ->M (with JM and JC checksummed) with the “right” set of mount options?
2. Is optimistic journaling currently being used in any file systems? If no, what challenges are faced in its adoption?

Summary
A new crash consistency mechanism for journaling file system that improve performance of system for many workloads using techniques like metadata checksumming and asynchronous durability notification. Introduction new API’s called osync and dsync allowing disk usage based on application need
Problem
With write buffering provided by the disk the file system performance has increased but the the recovery becomes difficult in case case of crashes. So the application end up using a pessimistic approach to flush the data which is an expensive operation. The paper suggests a new optimistic crash consistency mechanism which uses a probabilistic approach to achieve better crash recovery with high performance.
Contributions
Contributions of the paper are optimistic crash consistency - Optimistic journaling allows the disk to perform writes in any order and in case of crash, consistency is maintained for ordered transactions. This is achieved using using checksums and asynchronous durability notification.
1] Checksums removes the constraint on the disk of ordering the writes. Metadata transactional checksumming - checksums for the Metadata journal is calculated and stored in Journal commit. In case of crash before the journal metadata is completely written the checksum doesn’t match and hence this transaction and any further transactions are discarded.
2] Asynchronous durability notification. The disk provides 2 notification once for receiving the request and once for completing the specific write request. Hence decoupling the Request acknowledgement from persistence enables higher levels of I/O concurrency.
3] Selective data Journaling ensures that if data is over-written it is being logged into journal which helps in recovery. Hence Data journaling only occurs when required hence improving the performance of disk .
4] Providing the 2 new API’s dsync and osync allows the disk to be used based on application needs i.e.,consistency over performance or eventual durability
Evaluation
The Reliability evaluation is good which evaluates the consistency mechanism by performing large number of crashes, what i found missing in this evaluation is the to what level where they were able to recover the data, details of amount of data in the disk before crash and what amount of data recovered after the crash should have been provided in the table.The performance evaluation is very good where they have the compared with ext4 file system with and without flush , and a clear explanation of in which cases OptFs performance is better than other and in which case it does not .They have also providing a overall summery stating for which workloads OptFs is better. A clear evaluation of memory and CPU utilization is shown. Performance under memory pressure is also evaluated. A clear case study with Gedit and SQLite give the evaluation of osync API and how OptFs is able to achieve the performance of ext4 w/o flush and faster than ext4 with flush. Important Missing evaluation is recovery time after the crash, how is it compared to other systems as now we have to calculate the checksum it takes more time.
Confusion
This approach seems to perform well on specific workload. Is this technique used anywhere in the industry ?

1. Summary
This paper introduces a new type of crash consistency called optimistic crash consistency which both correctly recovers from crashes and delivers high performance. Their system is a modified design of ext4 called OptFS. The paper talks about how pessimistic & probabilistic crash consistency work and the problems involved in them. They then talk about their idea of optimistic crash consistency and its properties, techniques and implementation detail. Finally they compare their OptFS design with the design of ext4 with & without flush on different applications and operations

2. Problem
The main problem is the tradeoff between performance and consistency. Introduction of write buffers have allowed out-of-order writes and this can lead to inconsistencies when recovering from crashes. Write ordering is usually provided by forcing a cache flush operation but this is very expensive and leads to performance degradation. Some application avoid this and sacrifice correctness for performance. This was a pessimistic approach (pessimistic crash consistency). Although probabilistic crash consistency overcomes this to some extent, many application need more than a probable guarantee

3. Contributions
The paper has explained and performed an extensive study on both pessimistic and probabilistic crash consistency which has helped them make some design decisions on their implementation of optimistic crash consistent FS called OptFS. They have explained some of its features like pessimistic journaling using checksum to reduce the need for ordering, huge performance degradation caused by flushing and how workloads, queue size & journal layout (window of vulnerability) affect the probabilistic consistency. Then the paper talks about its idea of optimistic crash consistency. OptFS's main idea is separating ordering from durability and they have decided to sacrifice freshness for consistency and performance. The journaling protocol writes in order of Data-Write, metadata journal, commit journal and checkpoint. They handle reordering in 2 ways: some re-ordering can be allowed & then detected later and some must be avoided. It uses asynchronous durability notification to avoid reordering by providing delayed writes (e.g. ensure M written after D, Jm, Jc) where the disk notifies the client twice (disk received write, write persisted) and it uses checksum to detect inconsistency caused after crash due to reordering. They have used 6 optimistic techniques: in-order journal recovery & release, checksuming, background write after notification, reuse after notification and selective data journaling. The idea of using checksum for both data and metadata, doing detection instead of prevention is interesting and the idea of using/providing delayed/old durable state (during failures) is also intriguing (many applications are fine with this and I feel this is one of the main contributions/discovery of this paper) but I would like to know more about what work is going on currently to provide such performance-consistency guarantees for real-time applications. The paper was able to explain the working of each of these techniques with examples but I did not understand when does a system decide to do selective journaling. They finally talk about their 2 implementation of sync operation : osync (consistent version of stale data) & dsync (consistent version of fresh data)

4. Evaluation
They have compared their design of OptFS with the design of ext4 with & without flush on different applications and operations. In most of the scenarios it has performed considerably better than both implementation of ext4. They compare them for different types of operations like sequential write, sequential overwrite, random writes, create files as well as different applications like Fileserver, MYSQL etc. OptFS performed better in all application except MYSQL and for all operations except sequential overwrite. Sequential overwrite was the reason why it performed worse for the MySQL application compared to ext4 without flush. They also show that it has added a 10% overhead and can provide good performance with a journal size of 512 MB. This was a result which I feel they could have elaborated further as they haven't explained why the rate of increase in performance decreases (plateaus) with increase in buffer size. Finally they talk about two case studies and show that OptFS does not leave systems in inconsistent state, it leaves most of them in an older but consistent state and performs more operations per second compared to both ext4's implementations. Overall they were able to evaluate and justify the results (as well the superiority of OptFS's design) for all the experiments except for the one on journal size.

5. Confusion
I did not understand what they were trying to say in section 5.2 on handling data blocks, i.e. why you need individual block checksums? Also how is the decision on when to do selective data journaling made?

1. Summary
This paper introduces a new type of crash consistency called optimistic crash consistency which both correctly recovers from crashes and delivers high performance. Their system is a modified design of ext4 called OptFS. The paper talks about how pessimistic & probabilistic crash consistency work and the problems involved in them. They then talk about their idea of optimistic crash consistency and its properties, techniques and implementation detail. Finally they compare their OptFS design with the design of ext4 with & without flush on different applications and operations

2. Problem
The main problem is the tradeoff between performance and consistency. Introduction of write buffers have allowed out-of-order writes and this can lead to inconsistencies when recovering from crashes. Write ordering is usually provided by forcing a cache flush operation but this is very expensive and leads to performance degradation. Some application avoid this and sacrifice correctness for performance. This was a pessimistic approach (pessimistic crash consistency). Although probabilistic crash consistency overcomes this to some extent, many application need more than a probable guarantee

3. Contributions
The paper has explained and performed an extensive study on both pessimistic and probabilistic crash consistency which has helped them make some design decisions on their implementation of optimistic crash consistent FS called OptFS. They have explained some of its features like pessimistic journaling using checksum to reduce the need for ordering, huge performance degradation caused by flushing and how workloads, queue size & journal layout (window of vulnerability) affect the probabilistic consistency. Then the paper talks about its idea of optimistic crash consistency. OptFS's main idea is separating ordering from durability and they have decided to sacrifice freshness for consistency and performance. The journaling protocol writes in order of Data-Write, metadata journal, commit journal and checkpoint. They handle reordering in 2 ways: some re-ordering can be allowed & then detected later and some must be avoided. It uses asynchronous durability notification to avoid reordering by providing delayed writes (e.g. ensure M written after D, Jm, Jc) where the disk notifies the client twice (disk received write, write persisted) and it uses checksum to detect inconsistency caused after crash due to reordering. They have used 6 optimistic techniques: in-order journal recovery & release, checksuming, background write after notification, reuse after notification and selective data journaling. The idea of using checksum for both data and metadata, doing detection instead of prevention is interesting and the idea of using/providing delayed/old durable state (during failures) is also intriguing (many applications are fine with this and I feel this is one of the main contributions/discovery of this paper) but I would like to know more about what work is going on currently to provide such performance-consistency guarantees for real-time applications. The paper was able to explain the working of each of these techniques with examples but I did not understand when does a system decide to do selective journaling. They finally talk about their 2 implementation of sync operation : osync (consistent version of stale data) & dsync (consistent version of fresh data)

4. Evaluation
They have compared their design of OptFS with the design of ext4 with & without flush on different applications and operations. In most of the scenarios it has performed considerably better than both implementation of ext4. They compare them for different types of operations like sequential write, sequential overwrite, random writes, create files as well as different applications like Fileserver, MYSQL etc. OptFS performed better in all application except MYSQL and for all operations except sequential overwrite. Sequential overwrite was the reason why it performed worse for the MySQL application compared to ext4 without flush. They also show that it has added a 10% overhead and can provide good performance with a journal size of 512 MB. This was a result which I feel they could have elaborated further as they haven't explained why the rate of increase in performance decreases (plateaus) with increase in buffer size. Finally they talk about two case studies and show that OptFS does not leave systems in inconsistent state, it leaves most of them in an older but consistent state and performs more operations per second compared to both ext4's implementations. Overall they were able to evaluate and justify the results (as well the superiority of OptFS's design) for all the experiments except for the one on journal size.

5. Confusion
I did not understand what they were trying to say in section 5.2 on handling data blocks, i.e. why you need individual block checksums? Also how is the decision on when to do selective data journaling made?

1. Summary
The paper presents a new approach in journaling file system called optimistic crash consistency which looks to improve the performance of existing journaling techniques while ensuring durability. The paper discusses its implementation, OptFS and its primitives that help in achieving the necessary consistency goals.

2. Problems
Disk technologies like write buffering do not guarantee ordering of writes and hence the application should ensure ordering through calls like fsync which flushes the contents of the write buffer to the disk. The paper calls this type of approach to ensure consistency as pessimistic approach. This also results in degraded performance, as the disk is now busy in writing the blocks and the reads are delayed. Some journaling file systems disable flushing to improve performance at the risk of losing data depending on various factors like workload characteristics, queue size and journal layout. The authors state that the problem is conflating durability and ordering and propose a new technique which decouples these two parameters.

3. Contribution
The major contribution of this paper is the optimistic crash consistency approach which assumes that crashes are rare and achieves consistency by employing techniques checksums, selective journaling techniques, etc.
The proposed method provides asynchronous durability notification, wherein the disk sends two notification to the file system- one for reception of write request and one for persisting the write to the disk. This gives the file system more flexibility and schedule the writes efficiently. In the event of a crash, the file system walks through the committed journal to ensure consistency. During this operation, it uses the checksum to check if the updated metadata journal is pointing to new or old data block. In the event of block deletion followed by subsequent reuse by another file, optimistic journaling employs reuse after notification to avoid the security risk by committing the metadata update of the earlier transaction before updating the data block( Reuse after notification) . The same approach would be inefficient when application requires to commit both metadata and data block. Optimistic journaling adopts a less optimized technique of having data and metadata journals to solve this problem which marginally decreases the performance as two writes are involved now instead of one.

The authors also present the implementation of optimistic journaling called OptFS in linux 3.2 by modifying existing ext4 FS and the journaling and virtual memory subsystem layer. They also discuss the implementation of various Optimistic techniques like In-Order Journal recovery, In-Order Journal release,etc.

4. Evaluation
The authors have evaluated the methodology for consistency, performance and implementation overheads. The evaluation presents crash simulation and shows that the proposed mechanism is able to ensure 100% consistency after the crashes. Optimistic consistency performance significantly better (3x) for random writes as it converts it into sequential journal writes. But it suffers for sequential overwrites which requires in-place data block updates (requires 2 writes in optimistic journaling). The paper also shows the performance in macro benchmarks where OptFS performs significantly better than EXT4(with flush).

The paper also evaluates the overhead involved in this technique. Since the disk writes are delayed, more data is held in the memory and increases the memory footprint. Checksum computation also increases the CPU utilization. Journal size impacts OptFS performance and the results are tabulated. A brief case study on two applications( gedit and SQLite) are also presented to give a clear picture on performance and consistency improvements in OptFS. The authors have presented an extensive evaluation of the idea. However some more explanation on the overheads would have been helpful.

5. Doubts
Increase in CPU utilization seems to be really high for checksum computation and running background process. What else could potentially hurt CPU performance? Does the increased memory footprint hurt performance in anyway?

1. Summary
This paper presents OptFS, a journaling FileSystem that provides performance and consistency by decoupling write ordering from durability. This results in impressive performance gains over traditional crash consistent FSes, without loss of consistency, at the cost of durability becoming eventual, not immediate. The trade-off is that consistency is provided with prefix semantics.

2. Problem
High performance vs strong consistency trade-off (upto 10x perf gains possible)
No real middle ground. Many users choose performance. (e.g. ext3)
Current CC solutions – File Systems conflate ordering and durability.
Pessimistic in that ensures disk is always consistent using flush, which is expensive.

3. Contributions
Insight : disabling flushing does not always lead to inconsistency - Probabilistic crash consistency model, studied in detail. This helped served as the foundation for the Optimistic FS approach.
Key Innovation : Separation of ordering and durability.
Trade-off - Gain both performance and consistency, but durability is eventual, not immediate.
Pessimistic Journaling : {D|Jm -> Jc -> M}, or, {D -> Jm|Jc -> M}. D - Data block, Jm - metadata journal updates, Jc - journal commit, M - in-place Metadata update, '->' - flush.
OptFS Eliminates flushes in the common case – leads to re-ordering of blocks being written to disk.
Mechanisms : Asynchronous Durability Notifications, and an extension of Transactional checksums.
ADNs – signals provided by disk drives once write is persisted. ADNs help handle re-ordering, blocks can be destaged to disk in any order, at any time.
Invariant maintained : {D/Jm/Jc} and {M} are never dirty in disk cache together
Checksums over Data and Jm and stored in Jc, hence removing need for first flush.
Second flush, between Jc and M, is removed by delaying M write until ADN arrives for D, Jm, Jc.
Other optimistic Techniques:
In-order journal recovery (sequential recovery of Txs), and release (do not free journal Tx until M is persisted)
Reuse after notification (data blocks are re-used only after the Checkpoint is peresisted)
Selective data journaling (both data and metadata are journaled in cases where data-blocks are frequently overwritten and file needs to maintain its layout).
New FS Primitives :
osync() - provides ordering only, and hence high performance
dsync () - durability
Trade-off : data writes delayed, increasing perf, at cost of additional memory load

4. Evaluation
Evaluated as middle ground between ext4 without flushes (without consistency) and ext4 with flushes and with consistency. OptFS also sometimes outperforms ext4 without flushes as ext4 flushes dirty blocks in small batches in the background, while OptFS does so in large batches periodically or on commit. Below *x speeds are relative to ext4 with consistency.
Micro-benchmarks – write throughput and bandwidth. (Ops / second)
Random writes – 3x faster, sequential overwrites – 0.5x.
Macro-benchmarks – FileSystem workloads – fileServer, WebServer, Varmail (7x), MySQL OLTP (10x)
Parameters – resource consumption (more cpu and memory load) and journal size (worse - selective data journaling)
Application case studies – gedit atomic update (40x) , temporary logging in sqlite (10x)
My Opinion :
Keywords - Interface change, optimize for common case, separation / decoupling.
Big point - Eventual durability -> consistency is provided with prefix semantics, which means that while ordering is maintained, not all of the transactions will have been persisted (stale data).
Note on Gains - since ADNs are not yet provided, they used the worst-case durability value as a timeout approximation for the AND. Hence, if the ADNs are actually implemented, gains are likely to be higher.
Negatives - CPU and memory load higher, not great for sequential overwrite heavy workloads.

5. Confusion
More details on FUA would be good.
Also would like some more detail on the checksum and its storage + verification process.
Effect of queue size on total i/o time?

Summary
The paper introduce optimistic crash consistency, a new approach to crash consistency in journaling file systems .The goal of optimistic crash consistency, as realized in an optimistic journaling system, is to commit transactions to persistent storage in a manner that maintains consistency to the same extent as pessimistic journaling, but with nearly the same performance as with probabilistic consistency.
The Problem
The introduction of write buffering gave rise to a lot of consistency and durability issues springing from out-of-order write completion. For ensuring write ordering modern drives are equipped with cache flush operations. However cache flushing is plagued with multiple problems. Firstly it is expensive and make I/O scheduling less efficient. In addition, during a large cache flush, disk reads may exhibit extremely long latencies as they wait for pending writes to complete. Finally, flushing conflates ordering and durability. In a nutshell, the classic approach of flushing is pessimistic; resulting in high performance overheads. This has led some systems to disable flushing, apparently sacrificing correctness for performance. Hence the authors the authors propose a new file system,optimistic file system(OptFS) that takes the middle ground and provides both high performance and strong consistency.
Contributions
1.The paper carefully studies the problem of probabilistic crash consistency, wherein it is showed quantitatively which exact factors affect that a crash will leave the file-system inconsistent for the very first time. Different workloads are examined to have an idea for which type of applications is the probabilistic approach sufficient
2.The paper optimizes for the common case which is no crash. In the rare event that a crash does occur, optimistic crash consistency either avoids inconsistency by design or ensures that enough information is present on the disk to detect and discard improper updates during recovery.
3.OFS performs selective data journaling only for the data blocks that are repeatedly overwritten within the same file and the file needs to maintain its original layout on disk. Thus it can perform in-place updates for suitable workloads without incurring performance overhead of writing the data block twice for the common case .
4.Optimistic crash consistency handles reordering in a different way- some reordering are detected after crash via checksums over metadata and data while some are prevented from occurring by delayed writing.
5.It uses asynchronous durability notification signal when block is made durable. ADNs increase disk freedom – now blocks can be destaged in any order and at any time. OptFS uses ADNs to control what blocks are dirty at the same time in disk cache - re-ordering can only happen among these blocks. Waiting for the asynchronous durability notification before a background checkpoint is fundamentally more powerful than the pessimistic approach of sending an ordering command to the disk (i.e., a cache flush).
6.Central to the performance benefits of OptFS is the separation of ordering and durability. By allowing applications to order writes without incurring a disk flush, and request durability when needed, OptFS enables application-level consistency at high performance. OptFS introduces two new file-system primitives: osync(), which ensures ordering between writes but only eventual durability, and dsync(), which ensures immediate durability as well as ordering.
Evaluations
The evaluation section is very comprehensive and thorough. Firstly, the authors quantify the probability of inconsistency Pinc and list out the factors affecting it namely workload type, queue size and journal layout. Six different workloads are used to understand the impact of applications on Pinc. Next it is shown that OptFS is consistent after 400 random crashes . The performance of OptFS is measured by running micro and macro benchmarks. OptFS significantly outperforms ordered mode with flushes on most workloads, providing the same level of consistency at considerably lower cost. On many workloads, OptFS performs as well as ordered mode without flushes, which offers no consistency guarantees. One thing I like about the paper is that the authors do not shy away from pointing out the performance laggings in their implementation and adequately explain their observations. For example, it is seen that OptFS consumes 10% more CPU than ext4 ordered mode without flushes,the authors argue that this overhead is due to CRC32 data checksumming. Also poor performance of OptFS as compared to ext4 without flushes under journal size crunch is reasoned to be because of selective journaling as is for why OptFS is not suitable for workloads with sequential overwrite. Additionally the authors show that meaningful application-level consistency can be built on top of OptFS as is demonstrated by studying gedit and SQLite on OptFS. OptFS is 40X and 10X faster per operation than ext4 without flushes for gedit and SQLite respectively. One small complaint with the paper is that I would have liked a better explanation as to why durability time out was set up to be 30 sec.
Confusions
My belief is that inconsistency issues (like erased block being accessed by durable metadata from before) should exist even after the window of vulnerability . What am I missing?

1. Summary
The paper discusses Optimistic crash consistency, a new journaling file system. The approach achieves the performance of probabilistic crash consistency while maintaining a crash consistent image. The authors develop a prototype file system OptFS to demonstrate the viability of this solution.
2. Problem
The problem stems from journalling file systems needing to impose relative ordering to ensure a crash consistent image. In most file systems and block devices the concepts of ordering and durability are conflated leading to the file system using expensive cache flush operations to impose an order on operations. On the flip side, not doing these flushes would lead to massive performance gains but a probabilistic consistency model. This model could crash in an inconsistent state with a probability depending on workload characteristics. No approach offered the best of both worlds.
3. Contribution
The paper’s primary contribution is the separation of ordering from durability. This allows file system changes to be consistent (due to relative ordering) without incurring the cost of a cache flush at every step. The paper also documents the steps required for recovering after a crash and releasing the journal blocks after checkpointing the metadata. The paper smartly uses checksums to reduce the ordering requirements between the data, journal metadata and journal commit blocks. The paper also takes care to ensure that data blocks are only reused when no metadata points to them so that after a crash, stale metadata does not get access to other file blocks. The paper introduces selective data journalling to handle in place overwrites by writing the data out to do the journal first, hence ensuring that multiple overwrites to the same block do not lead to a final inconsistent image. The paper still provides the legacy cache flush feature by introducing two system calls, osync(): to ensure relative ordering but eventual durability and dsync(): for ensuring immediate durability as well as relative ordering
4. Evaluation
The paper first establish a baseline by studying both probabilistic and pessimistic crash consistency, they study both the performance benefits of disabling the flush command and the window of inconsistency introduced when flush is disabled based on the workload. The authors then establish the performance of OptFS by running various micro benchmarks to capture sequential/random read/write performance. This shows the fundamental strengths and weaknesses of this solution. The paper then runs benchmarks with different I/O characteristics to capture the real world performance and resource consumption. The paper also includes real world case studies to show how an application could use the new system calls instead of fsync() to derive the performance benefits of OptFS while still ensuring a crash consistent image. An important point to note here is that while OptFS always leads to a crash consistent image, it reboots to the old state more often than either ext4 with or without the flush command disabled. The authors quote SQLite documentation as saying that most people only care about the database being consistent. However, many applications may need databases to reflect the updated state and in that case, even ext4 without flushing performs better. Overall the evaluation does not leave many questions unanswered. However the authors rely on a durability timeout to emulate an asynchronous durability notification. They could have developed a solution based on FUA as it would demonstrate a viable product that would be usable without having to wait for disks to be developed with the correct notification available.
5. Confusion
I am confused about how absolute ordering is maintained between metadata updates of different transactions. The paper discussed the absolute order between the journal commits but not the journal checkpoints, hence M1 could be overwritten by M0.
Secondly, I am not sure where the block wise checksums are stored in the journal, they may be too many for the commit block. The paper mentions descriptor blocks but these aren’t explained.

Summary:
The paper describes the design and implementation of optimistic crash consistency (OptFS) in journaling file systems, with a central idea of avoiding expensive disk flushes to ensure ordering while still providing certainty of consistency over crashes. The basic idea of the paper is to decouple the ordering of writes from their durability.

Problem:
Traditional file systems ensure ordering of writes through expensive disk flushes. These flushes also enforce ordering of blocks that were in the write queue but that don’t require any particular ordering. Some file systems work around this by using transactional checksums to verify the veracity of the metadata written in the journal. The use of flushes to ensure ordering proved to be a performance bottleneck. On the other hand, disabling flushes would lead to potential inconsistencies depending on workload, timing of crash,etc. The paper proposes an approach to provide both guaranteed consistency as well as performance.

Contributions:
1) The paper proposes a disk interface with a minor addition. The paper calls for a ‘Asynchronous Durability Notification” that informs the file system that a particular request has been persisted to disk. Traditional disks expose an interface called Forced Unit Access that provides a synchronous durability notification. OptFS uses this notification to enforce weak ordering between journal and checkpoint metadata disk writes.
2) The paper also describes some properties which need to be maintained by any file system implementing optimistic crash consistency. First, it must order the metadata checkpointing based on the transaction order. Second, it must ensure a journal commit write before the corresponding metadata write.
3) The paper also discusses some techniques to implement the idea. Two of these relate to the in-order recovery and release of journals. Checksumming over data and metadata allows for out-of-order commits of data and the journal log. OptFS also allows for background metadata checkpointing at a later stage to allow flexibility for the disk scheduler. OptFS also handles scenarios where data blocks are reused. OptFS also does selective data journalling to ensure contiguity in data blocks placement at the cost of double data writes.
4) OptFS exposes two file system primitives osync() and dsync() that ensure ordered but only eventual durability and immediate durability respectively.

Evaluation:
The authors demonstrate the correctness of OptFS over crashes. OptFS seems to restore consistent state at every crashpoint. The authors have also shown the performance improvement over ordered journalling file systems with both micro and macro benchmarks. The paper also shows how journal size affects the number of transactions committed per second for OptFS. Case studies for two applications has been done to analyze the impact of workload characteristics on the consistency guarantees. The paper claims that the cases of selective data journalling are infrequent. The storage overheads of writing data twice for large writes could have been shown to show the worst case impact. Overall, OptFS has been sufficiently evaluated wrt both pessimistic and probabilistic crash consistencies.

Confusion:
Are file systems like LFS and OptFS orthogonal? They seem to share certain common goals. How are they different?

1. Summary
The authors introduce the concept of Optimistic File System that attempts to achieve middle ground between performance and consistency by decoupling ordering and durability via introducing two new primitives osync() for ordering with high performance and dsync() for eventual durability after splitting fsync().

2. Problem
The crash consistency problem stems out of the fact that each file-system operation updates multiple on-disk data structures. System may crash in the middle of the updates leaving the file system partially updated. The journaling protocol comprises of Data write(D), logging Metadata(Jm) and Commit(Jc), eventually Checkpointing(M). It is highly dependent on ordered writes and is inefficient when only ordering is used with cache flush. It is an expensive operation and leads to low performance in such pessimistic crash consistent systems. The probabilistic crash consistency file system is obtained when flushing is disabled and the Jm and Jc are re-ordered leading to windows of inconsistency. But these systems are not desirable when strong consistency needs to be guaranteed with high performance.

3. Contribution
New primitives like dsync() and osync() has been introduced that provides ordering among writes at high performance using checksum and delayed writes techniques, and handle reordering from removing flushes. Via modifying the disk interface, Asynchronous Durability Notifications signal is introduced and sent when the block is made durable after D, Jm, Jc signals are received. Only then M is written to disk. In-order journal recovery and release, reuse after notification and selective data journaling is used for when the blocks are deleted in one transaction and then reused in other before journaling is durable, and for in-place block changes. The transaction is discarded if the stored and calculated checksum of data and metadata don't match.

4. Evaluation
The authors run multiple benchmarks and real world applications that stress the system to test its reliability, performance and resource consumption. OptFS performs 4-10 times better than ext4 with flushes. It consistently outperforms both ext4 with and without flushes in terms of time spent per operation and since its conservative in nature in terms of updates, it has more cases of obtaining old consistent states after crash instead of the new one for the experiments with gedit and SQLite workloads. Since the addition of interrupt for writing into disk was simulated, in my opinion it would have been more insightful to understand how much added latency is the 2X interrupts introducing if durability led to disk interrupts. The OptFS has been thoroughly tested keeping multiple factors like workload, disk queue size and journal layout affect the probabilistic guarantees in mind and quantifying the probabilistic consistency appropriately.

5. Question
I want to understand from Figure 5, why is the total I/O time changing very slightly as the distance between data and journal increase and drastically reduce inconsistency probability.

1. Summary
Authors introduce a new way for handling crash consistency in journaling file systems as a result of decoupling ordering from durability. In case of crash, it avoids inconsistency by design or ensures that enough information is present to detect and discard improper updates during recovery. Techniques like In-order journal recovery, release, checksums, background write notification, reuse after notification and selective data journaling are employed.
2. Problem
Write buffering causes disk writes to be out of order. Cache flush make IO scheduling less efficient, unnecessarily forces all previous writes which may cause long latencies in disk reads. Poor performance is exacerbated by conflating ordering with durability. Probabilistic approach is insufficient to realize higher-level application consistency such as write heavy workloads and early commit before data.
3. Contributions
It achieves both high performance and strong crash consistency by decoupling ordering and durability of disk writes. Such decoupling maintains strong crash consistency while trading file-system freshness for increased performance. It requires a slight modification to the disk interface, adding new signals called asynchronous durability notifications. Transactional checksum is maintained to detect data/metadata inconsistency. Block reuse is delayed to avoid incorrect dangling pointers. Block overwrites are handled with selective data journaling. These techniques lead to both high performance and deterministic consistency. It doesn’t require changes to file-system structure outside journal. Checksums remove need to order writes by generalizing metadata transactional checksums to include data blocks. Transaction checkpointing is delayed until durability with asynchronous durability notifications. It doesn’t constrain disk writes and gives flexibility to disk for best caching and scheduling decisions. Ordered journaling ensures metadata serializability and validity of its pointer. Two new calls are provided: osync to guarantee ordering between writes with performance same as when cache lush is disabled and dsync to guarantee durability of pending writes.
4. Evaluation
Disabling flush causes vast performance improvement compared to various historically produced optimizations over pessimistic crash consistency. Since checkpointing occurs in background, there is no impact on performance due to waiting for set of writes. Due to prefix semantics of osync, user might see stale data but it is guaranteed to be consistent. Since it uses durability timeouts due to lack of disk interface, there could be delay in receiving asynchronous durability notifications. It might take longer recovery time as it need to read data blocks off non-contiguous disk locations for checksum. Performance could reduce under memory pressure when it issues disk flush to ensure durability of checkpoint blocks while freeing memory buffers. Sequential overwrites cause drop in bandwidth while random writes are faster due to selective data journaling. Due to its delaying of writing dirty blocks, it performs well as compare to ext4 even with flush disabled which issues writes in small batch. This delay on the other hand increases memory consumption. Overhead of computation is imposed by data checksumming and background thread which frees data blocks. Atomic update of a file is evaluated with document editing application while ordered transaction and eventual durability is guaranteed in SQLite database management system. It reduces the number of durability events by enforcing weaker writes ordering which also avoids complex dependency tracking inside OS.
5. Confusion
How does disk scheduler queue size affect inconsistency?

1. Summary
The authors have come up with a new approach for crash consistency in journaling file system by decoupling ordering of writes from their durability. To achieve this, they introduce two new system calls osync() and dsync(). By extensive experiments they proved that their design delivers high performance and ensures crash consistency for the applications.

2. Problem
Existing FS ensures ordering by cache flushing which is an expensive operation as it provides scheduler with less options to pick from. It also mixes up two different issues of ordering and durability. The reason for such policy is the pessimistic approach that disk can crash and hence places a lot mechanisms to prevent it, thereby hurting performance.

3. Contribution
The main contribution of the authors was identifying the possibility of performance improvement by performing several experiments. They showed how checksum and not using the synchronous flush can improve the throughput. They also also studied various scenarios where absence of flush does not lead to an inconsistent state (read operations), whereas by ordering writes differently can lead to a much more consistent state. They also did a neat job of quantifying the probability of the inconsistency by experimenting how it varies with workload, queue size and layout of the journal.
The authors then introduced notion of asynchronous durability notification which decouple receipt of request with the actual write. This was possible due to checksum mismatching.Their optimal consistency introduced several properties which allows parallelism for data block, journal metadata and journal. But by ensuring that they still met consistency semantics, their implementation could also perform faster. Optimistic Techniques such as - Inorder journal recovery replays transaction recovery only till a valid Txn:i and discard subsequent ones. In-order journal release suggested to not discard the journal metadata and checkpoint until Metadata (M) correctly points to the data. By ensuring that the ‘M’ writes can happen in the background, journal writes did not have to wait for previous checkpoints. Reuse after notification techniques handled the scenario when two different ‘M’ could point to same data block by using only ‘durably free’ data blocks and freeing a data block after its ‘M’ was deleted. Selective data journaling handles the scenario when data blocks are written in the journal instead of writing them in-place. This, however removes the logical contiguity and double the amount of writes. The authors then simulated their implementation OptFS. A lot pathological scenarios arose such as handling overwrites of multiple data blocks which have just one checksum suggested to use per data block checksum.

4. Evaluation
The evaluation presented in the paper was quite thorough. First they proved that OptFS was reliable by introducing scenarios which can lead to inconsistent state and OptFS being able to bring the FS in a consistent state. Then they compared the OptFS with two modes of ext4 - with Flush (the default) and without Flush (which could introduce inconsistency). They proved that OptFS was slower for sequential overwrites since it needs to write every data twice. Random writes were faster because of batching. The macro benchmark experiment provided a fair comparison of real life applications. It proved that OptFS was not so great for workloads with sequential overwrites. They then proved that OptFS consumes more CPU (due to CRC32 checksum) and memory (due to data blocks freed in background). Lastly they compared their implementation across two applications - Gedit and SQLite and showed that it resulted in 0 inconsistent state.

5. Confusion
I am still not clear is consistency ensured when the disk doesn’t crash and the checkpointing happens out of order? Also, would like to quickly see the scenario when metadata commits before the actual data in case of ext4 file system; how can inode point to the wrong data with ‘right’ set of mount options.

1. Summary
This paper describes a new approach to ensure crash consistency in journaling file systems known as optimistic crash consistency. The purpose of the proposed mechanism is to achieve consistency to the same extent as pessimistic journaling while achieving the same performance with probabilistic consistency. The authors give a detailed motivation for the need for such an approach and the shortcomings of the existing approaches. The authors go on to give details of their proposed solution and back it up with a fairly convincing evaluation. Initial results show that the proposed solution improves the performance of many workloads and also has the ability to recover to a consistent state after a crash.
2. Problem
With the introduction of write buffering, disk writes could be completed out of order. This complicates techniques for recovering from system crashes. Prior to this work, write ordering was ensured either by using pessimistic consistency or probabilistic consistency. The former has a number of shortcomings – flushes make IO scheduling less efficient, experience high latency disk reads during long flushes and conflate ordering and durability (assumes crash occurs commonly). On the other hand, though the latter has a better performance in comparison to the former approach, it may not be useful for applications that require certainty in crash recovery. The authors try to solve the aforementioned problem.
3. Contribution
The basis of the proposed solution is the separation of ordering and durability which ensures that consistency if achieved without sacrificing performance. According to me, the initial part of the paper wherein the authors give details regarding pessimistic crash consistency and probabilistic crash consistency (along with factors that affect the probability of inconsistency) clearly makes the reader understand the need of the proposed solution. The end goal of the proposed solution is to eliminate the intermediate flushes present in between D, Jm, Jc, M. The authors propose a number of techniques to do so. Firstly, an in-order journal recovery is adopted by means of which the recovery stops at the first transaction that is not complete upon disk (sacrifices freshness). Secondly, the authors also adopt an in-order journal release that ensures that the system waits to release the transaction until it receives a notification from the disk that the checkpoint writes are durable. Thirdly, the authors get rid of the flushes between D, Jm and Jc via the use of checksums (data as well as metadata checksumming). Fourthly, the authors also propose the use of asynchronous durability notifications (requires support from the disk interface) to ensure that checkpoint of M occurs after D, Jm and Jc. Additionally, the authors handle consistency when blocks are reused by ensuring that blocks can be reused only after notification (which ensures that the updated metadata journal entry is durable before new data can be written). Lastly, the authors also introduce the concept of selective data handling to take care of the situation when block data is over-written (doesn’t lose out on locality).
4. Evaluation
The authors evaluate the performance of optimistic crash consistency by implementing the proposed solution to come up with a prototype known as OptFS. Firstly, the authors verify OptFS’s consistency guarantee by simulation crashes under two synthetic workloads. The results obtained show that OptFS recovers correctly in all 400 different crash scenarios. Secondly, the authors analyze the performance of OptFS using a variety of micro and macro-benchmarks. The results obtained show that OptFS improves the performance considerably in many cases (random write, create files) due to the fact that OptFS delays writing dirty blocks (and instead uses batching) and reduction in the number of flushes. The authors also point out that the proposed solution is not suited for workloads (MySQL) that have many sequential overwrites (due to the overhead of writing data twice in case of selective data journaling). The authors go on to compare resource consumption as well as memory overhead of OptFS and attribute the overheads to checksumming, background thread activity and delay check pointing. The authors also evaluate the performance of OptFS with respect to the journal size as well show how it can be used to provide application-level consistency using gedit and SQLite as examples. Though the authors have done an excellent job on the evaluation of the proposed solution, there a number of aspects that could have been evaluated. Firstly, the performance degradation of OptFS due to selective data journaling could have been evaluated. Also, it would have been ideal to give more details regarding other factors that affect the probability of inconsistency (merely stated as not significant).
5. Confusion
Last class we read that GFS was designed for scenarios where crashes were a norm. Through this paper, we read that the proposed solution assumes that crashes aren’t a norm. What are your thoughts on using the proposed solution in datacenters? Would it be useless as failures are a norm in datacenters?

1. Summary: This paper presents a new FS, called OptFS which modifies the ext4 file system to present a crash consistency model which provides perfect consistency guarantees at much better performance, by removing expensive cache flushes, and using transactional checksums to perform out-of-order writes. The authors call their model Optimistic Crash Consistency.
2. Problem: Write buffering was used in disks to allow the disk scheduler to re-order writes to reduce the seek and rotational overheads and achieve better performance. But this aggravated the problem of maintaining the disk state consistent during crashes. With inxepensive disks, crashes are much bigger of a problem now! To guarantee consistency, existing journaling FSs used techniques like cache flushing, which significantly reduces performance (pessimistic approach). This leads to an average performance degradation, that too due to the unlikely scenario of a crash. Other FSs ignored the consistency problems, resulting in probabilistic crash consistency, though at a better performance. The authors propose a solution to get the best of both-the-worlds.
3. Contribution: Optimizing for the common case seems to be a central theme in systems, and the authors have done exactly that. They realize that crashes are infrequent, so degrading the average performance for crash consistency doesn’t make sense. Thus, they didn’t use cache flushing. One of the major contributions is the addition of powerful interfaces: osync() and dsync() calls to solve their problem, effectively decoupling ordering requirements from durability. Since the FS no longer uses cache flushes to to enforce ordering of journal data, they use, and improve on transactional checksumming techniques. They calculate checksum over data as well to provide even more flexibility in reordering of D,Jm and Jc components. As a result, data blocks, journal metadata, and journal commits within a transaction can be reordered. Since the applications may need to ensure consistency guarantees, they also introduce asynchronous durability notifications. To maintain the data locality of files, they use selective data journaling to make in-place overwrites. I particularly liked how each design decision of OptFS can be traced to a basic idea in the systems community. This proves novel designs need not be over-complex!
4. Evaluation: To prove their point, of crashes being infrequent, the authors note that ext3 FS with probabilistic crash consistency, leads to very few observable inconsistencies. They make a thorough analysis of why, and under what conditions these inconsistencies occur. They show that by increasing the disk distance between data and journal metadata, the probability of inconsistency decreases, since data will be written first, although at a higher performance penalty. Then they show the performance and recoverability of their implementation. They show that their FS can recover from crashes by simulating 400 crash scenarios, and using disk images till that point to recover successfully everytime! To measure performance, they run micro- and macro- benchmarks and compare against probabilistic and pessimistic models. The good thing the base FS in all the cases is Ext4. Thus, the performance differences are only because of the modifications in crash consistency models. Because of selected data journaling, sequential overwrites perform poorly in OptFS, but in all other cases it is as good as probabilistic consistency model. This shows that OptFS is nearly as fast as the probabilistic crash consistency model, with full consistency guarantees - the best of both the worlds. I also liked how the authors ported real-world applications and compared their performances with their FS.
5. Confusion: How do you prove that a system is reliable for all crash scenarios? How do systems like ext3 with no cache flushing eventually recover from inconsistencies?

1. Summary
This paper presents optimistic crash recovery, which is a combination of several techniques with the overall goal of crash recovery and high performance. A file system implementation is done inside Linux 3.2 and several case studies are evaluated.

2. Problem
Modern storage devices achieve write ordering by flushing the cache in a “pessimistic” fashion. While this protects against crashes, overall performance is greatly degraded because cache flushes are very slow. Optimistic crash recovery builds on probabilistic crash consistency and introduces several techniques to achieve high performance and deterministic consistency.

3. Contributions
Optimistic crash consistency builds on probabilistic crash consistency, which is measuring the amount of time spent in “windows of invulnerability” versus the total run time. The probability is highly work-dependent with high variance, but there are several heavy contributors: early commit before data being the main one. Due to crash probability, journaling is traditionally pessimistic, but an optimistic approach can help patch the holes in probabilistic crash consistency.
The basic idea is to separate durability from ordering. There are two cases under consideration: an “ordering” sync where order should be guaranteed between writes, and a “durability” sync where writes are forced to stable storage without ordering being the main concern. Two system calls, osync() and dsync() are provided, and these calls provide different guarantees that applications can rely on.
The heart of the paper are a number of techniques designed to provide the consistency semantics of ordered journaling. Overall, optimistic crash consistency is facilitated by (1) using checksums to eliminate the need for ordering writes, and (2) use “asynchronous durability notifications” to delay checkpoints for transactions until they have been committed durably. To be specific, the techniques are:
(1) In-order journal recovery: recovering transactions in the same order that they were written to the journal. Upon recovery, the journal can just discard any write operations that occur outside of the desired ordering.
(2) In-order journal release: releasing journal block to free up memory after they have been acknowledged by the disk. This ensures that journal transactions aren't overwritten until the data is confirmed to be durable.
(3) Checksums: create tags for data and metadata blocks. Used to detect data corruption and lost writes.
(4) Background write after notification: perform periodic background writes using the VM subsystem
(5) Reuse after notification: when transactions are committed, finished data blocks can be freed or flushes as memory requires. Is used to ensure that durable metadata from earlier transactions don't point to data that is changed in later transactions.
(5) Selective data journal: use block allocation information to determine if they need to be journaled like metadata. Provides data journaling consistency semantics.

4. Evaluation
Two key metrics are evaluated: reliability and performance. Reliability is tested via a crash-robustness framework to see if the file system can correct itself after a crash. 400 such scenarios were run. Performance was measured via micro and macro benchmarks with the main point of comparison being ext4. The general summary is that OptFS generally outperforms ext4, but it may not be suited for workloads which are heavily comprised of sequential overwrites.
A number of case studies are also described, namely gedit atomic updates and temporary SQLite logging. For the former, OptFS performs similarly to ext4 without flushes (40x faster). For the latter, it has the advantage over ext4 in that it always results in a consistent state and is 10x faster that ext4 without flushes.

5. Confusion
What exactly does “durably free” mean?

Summary
The paper presents Opt FS - a journaling file system that provides performance and consistency by decoupling ordering and durability.

Problem

Single filesystem operation updates multiple on-disk data structures and if a system crashes in the middle of updates, it might leave the filesystem in an inconsistent state. This happens as the disk can order the writes in an out-of-order fashion. Crash consistency solutions employ cache flushing to ensure write ordering. Thus, cache flushing conflates ordering writes to disk with durability making ordering expensive. forcing users to choose between high performance and strong consistency. Such an approach is pessimistic as they think that a crash will occur and then go to great lengths to prevent an inconsistent state using flush. In order to gain performance, users disable flushes and rely on probabilistic crash consistency where there is a possibility of having an inconsistent state. However, such an approach also does not work well for workloads involving high number of random write I/O's. So, can one design a filesystem that provides both high performance and strong consistency?

Contribution

The main contribution of this paper is decoupling ordering and durability by trading the freshness of data for performance while maintaining strong consistency guarantees. OptFS eliminates flushes in common case and blocks may be re-ordered without flushes. However, some re-orderings can be detected after crash using check-summing and some re-orderings can be prevented from occurring thus ensuring strong consistency. OptFS rely on Asynchronous Durability Notification that signals the FS when the block is made durable. Thus ADN increases disk freedom and allows the blocks to be de-staged anytime, in any order. OptFS uses ADN to control what blocks are dirty at the same time in the disk cache. Checksum that is computed over data and metadata is checked during recovery to identify if blocks are lost during crash by detecting mismatch. Delayed writes for metadata is done until ADN are received for the previously issued data (D + Jm + Jc). Other optimistic techniques like in-order journal recovery(stop journal recovery on encountering first non-complete transaction upon disk) and release(preserve journal data until check-pointing is complete), background check-pointing, reuse after notification and selective data journaling also helps in optimistic journaling. OptFS introduces two new file-system primitives - osync(), which ensures ordering between writes but only eventual durability, and dsync(), which ensures immediate durability as well as ordering.

Evaluation

Firstly, the authors evaluate the factors (workload type, queue size and journal layout) that affect probabilistic inconsistency (Pinc). They use 6 workloads to understand how workload can impact Pinc and show that around 90-100% inconsistency occur when early commit happens before data commit. They use varmail to understand the queue size and journal layout and find that in-disk SPTF scheduling improves performance by about 30% with a queue size of 8 or more while distance between main filesystem structures and journal makes significant difference in Pinc. Secondly, the author evaluate OptFS for reliability by running 400 different crash scenarios. They found that for all scenarios, the filesystem is consistent. Thirdly, they measure OptFS performance by running micro and macro benchmarks. They notice that OptFS significantly outperforms ordered mode with flushes on most workloads, providing the same level of consistency at considerably lower cost. On many workloads, OptFS performs well as ordered mode without flushes, which offers no consistency guarantees. Though OptFS is not suitable for workloads which consist mainly of sequential overwrites. OptFS consumes around 10% more CPU than ext4 ordered mode without flushes due to data check-summing and freeing data blocks in background. The memory consumption is also high as OpTFS delays check-pointing and holds metadata in memory for a longer time. Fourthly, the authors measure the performance of OptFS when the journal space is consumed. They observe that it is 5X times better than ext4 ordered mode with flushes.
The authors also modify two existing real world applications - gedit and SQLite to understand the benefits of using OptFS by simulating crash-points. They observe that OptFS always results in a consistent state while ext4, without flushes, does not. OptFS do end up in old state more often than ext4 with flush as osync supports eventual durability but ensures ordering. However, OptFS is 40X and 10X faster per operation than ext4 without flushes for gedit and SQLite respectively.

Confusion
Do SSD's also suffer from the same behavior? One can even handle ordering at disk level. Wont that approach be better in terms of performance? The current approach wants the FS to have a big picture so that writes can be optimized for maximum efficiency. What's the trade-off between these two approach?

1. Summary
Optimistic crash consistency aims at providing both high performance and strong consistency. OptFS is a journaling file system that decouples ordering and durability and use that to maintain consistency while trading freshness of the file system state for increased performance. Introduction of new fs primitives allows flexibility to achieve such decoupling, thus increasing performance to upto 10x with strong fs and application consistency.
2. Problem
Out-of-order writes due to presence of write buffers makes the system inconsistent after crashes. So, the file systems require ordering guarantees when updating the data structures which is achieved using explicit cache flush operations that persist the dirty writes immediately. But this is expensive, thus degrading performance, it also makes I/O scheduling less efficient, large flush increases read latencies etc. This approach is called pessimistic crash consistency. Probabilistic approach is disabling cache flushes that increases performance but provides no consistency guarantee. Thus, we end up having a trade off between performance and consistency.
3. Contributions
The factors that leaves the file system inconsistent after the crash, as derived from the probabilistic crash consistency, are the workload (r=less, w=more), order of persistence, disk scheduler queue size, journal layout and finally the observation that this probability never reaches 100% => not always a crash leads to an inconsistent fs. This leads the authors to find a middle ground between the pessimistic and probabilistic in order to provide strong consistency and high performance. They do so by incorporating two main ideas- provide ordering without flushes using checksums and delayed writes (achieves strong consistency), and decoupling ordering and durability (achieves high performance). The ordering is either detected after crash using checksums or prevented from occurring using delayed writes with the help of ADN. The introduction of this asyn durability notification is a key contribution for any layered architecture, which enables the upper level (fs in this case) to make smarter choices while keeping the lower levels(disk) obey minimal requests thus improving the overall performance. They provide smart techniques of journal recovery, release, reuse after notification and selective data journaling to achieve the optimistic properties. Finally, they decouple ordering and durability by introducing new primitives- osync and dsync that improves the performance because of allowing eventual durability but this in turn makes the fs version stale. Overall, I liked the ideas and the techniques. It carefully achieves ordering to guarantee the consistency of the fs while also achieving high performance by only trading the file system state because of the eventual durability.
4. Evaluation
They implement it in a journaling file system called OptFS. Since the aim was to achieve strong consistency and high performance, they simulate many crash scenarios and gather that OptFS provides same consistency as ext4 with flush. For performance, they emulate different workload scenarios and compare it with ext4 ordered mode: they achieve 3x better performance for random writes as it converts them into sequential journal writes due to selective data journaling, 2x better for createfiles benchmark, and macro-benchmarks show 7-10x better performance majorly due to delayed writes. Finally, they apply the primitives to real-world applications- gedit and SQLite and evaluate the behavior on decoupling ordering and durability: OptFS achieves high performance with correct consistency but delays durability. The evaluation is pretty strong with the additional evaluation on pessimistic and probabilistic approaches but they haven’t shown the experiments that correspond to the optimistic techniques of reuse after notification or recovery-release. Resource consumption provides a brief idea, but what would be the behavior in case of a flash-flood of ADNs or any such extreme scenario that is workload specific?
5. Comments/Confusion
Did not understand effect of journal layout on Pin, FUA and tagged queuing technique. Is selective data journalling based on the application workload or if not, then how to switch this on/off? Also how does this improve performance of random writes? Are these techniques only for journaling file systems or independent of fs?

1. Summary
Optimistic crash consistency is a technique that provides crash consistency in journaling file system without hurting the performance by flushing frequently. It decouples ordering from durability and provides two new primitives osync and dsync to the application.

2. Problem
Journaling file systems need specific write ordering guarantees to maintain crash consistency, while storage devices usually provide high performance by buffering and reordering requests. The traditional way to achieve ordering is to issue a flush operation, which conflates ordering with durability and hurts the performance.

3. Contributions
To achieve ordered journaling consistency, the file system need to maintain the following write ordering: in-place write of data blocks (D) precedes journal write of metadata (JM), which precedes journal write of commit block (JC); all of these operations, called a transaction (Tx:i), must happen before the next transaction (Tx:i+1) and in-place checkpointing of metadata (M).
The ordering constraint D->JM->JC is removed by using a checksum. A checksum of all these blocks is saved in the commit block and can be used to validate whether the writes to these three parts succeeded or not.
The ordering constraint D|JM|JC -> M is achieved by adding an asynchronous durability notification protocol to the disk interface. Besides the notification after receiving a request, the disk is also expected to send a notification after the write is made durable. This way the file system can start checkpointing a transaction only after receiving all the notifications for D, JM and JC. There is no need to flush for keeping this ordering.
The ordering constraint Tx:i -> Tx:i+1 is achieved by in-order journal recovery and release. The idea is simple: they are allowed to be written out of order, but a transaction is only considered to be present if all previous ones do.
Data block reuse is delayed until the transaction that frees this block is durable. Data block overwriting is handled by either allocating new data blocks or doing selective data journaling. The latter puts data in the journal and does in-place update of both data and metadata during checkpointing, thus provides better locality than the former method but requires two writes of each data block.
At the application layer, fsync is replaced with osync and dsync, separating the guarantee for ordering and durability. The semantics are similar to ordering versus flush at the block level.

4. Evaluation
The goal of optimistic crash consistency is to achieve the consistency guarantee as pessimistic journaling while performs as fast as probabilistic consistency. The authors verified the correctness of OptFS by running synthetic workloads. For performance, both micro-benchmark and macro-benchmark are done. Results show that OptFS performs well for workloads with heavy use of fsync and not worse in all other workloads except sequential overwrite.
They also measured the CPU consumption, memory consumption and journal size requirement, showing that OptFS is feasible in practical use.

5. Confusion
Why is the virtual memory subsystem involved?

1. Summary
This paper describes optimistic crash consistency for journaling file-systems and is built using an optimistic commit protocol which delivers high performance and recovers correctly from crashes. The authors also introduce new file-system primitives osync() and dsync() to decouple ordering from durability. The authors use ADNs (Asynchronous Durability Notification), transactional checksumming, delayed writes and selective data journaling to improve the performance and correctness over probabilistic crash consistency models. Empirical data from micro and macro benchmarks (robustness tests) are used to prove that OptFS provides correctness and improved performance on diverse workloads.

2. Problem
Storage devices prior to the introduction of write-buffering had simple read and write primitives which worked correctly. The introduction of write-buffering necessitated the need for write-ordering to guarantee consistency as writes could now complete out of order and crashes could occur at any instant. Many pessimistic systems use explicit cache flushing within modern drives to provide persistence and this results in extremely poor performance. However, many modern system designers are now sacrificing correctness for performance by disabling flushing (probabilistic approach) as crashes are a rare occurrence imposing a heavy burden.
Prior efforts to improve pessimistic journaling introduced two major optimisations (D|Jm -> Jc -> M, D-> Jm||Jc -> M) that could not be coupled (D|(Jm||Jc) -> M) effectively to provide correctness. The authors attempt to solve this problem by using transactional checksums and asynchronous durability notifications (ADN) combined with novel ideas like selective data journaling, delayed writes, etc...

3. Contributions
One of the major contributions of this paper was the detailed study of pessimistic and probabilistic crash consistency models. The authors studied the impact of flushing on performance to found a 5X performance improvement on systems that had no-flush as compared to a system that used explicit flushing and justify the move to a probabilistic model. They also quantified probabilistic consistency and introduced P(inc) to study inconsistency on no-flush systems. Their study clearly describes the impact of workload, queue-size and journal-layout on P(inc). Also, they provided fine grained breakdown of the inconsistency (early-commit, early-checkpoint, transaction misorder and mixed) and attributed 90% of them to early-commit. One of the major contributions of this paper was the decoupling of ordering from durability. The authors introduced ADNs as a mechanism to inform clients when disk-writes became durable (on-disk). They also clearly define the properties of an optimistic crash consistency and describe in detail several optimistic techniques like In-order Journal Recovery, In-order Journal Release, Transactional Checksums, Background Write after Notification, Reuse after notification and Selective data journaling both in theory and how they are utilised within OptFS.
Overall, the system doesn't guarantee strict ACID properties but ACI-(eventual Durability).

4. Evaluation
The authors evaluate their implementation (OptFS) along two axes reliability and performance. To verify reliability, two different workloads (append, overwrite) were used along with osync() to stress OptFS and in all cases (400) the system recovered to a consistent state. To evaluate performance, several micro and macro benchmarks were utilised. The micro-benchmarks helped study the behaviour of OptFS on sequential-write, sequential-overwrite, random-write and file-creation. OptFS performed better than Ext4 (w and w/o flush) on almost all these cases except for the case of sequential-overwrite where OptFS uses ordered data-journaling mode (every data-block is written twice) and the bandwidth is halved. Macro-benchmark workloads Filebench Fileserver, Webserver and Varmail were run and here again OptFS performs as well (1X) as ext4 with no-flush except for the case of Varmail where OptFS has a 7X performance improvement. This improved performance is because Varmail frequently uses fsync () and OptFS delays writing dirty blocks and issues them in large batches periodically or on commit. They also studied the performance of OptFS on database workloads by running the MySQL OLTP benchmark from Sysbench. Here OptFS performs 10X better than ext4 ordered mode with flushes and 40% worse than ordered mode without flushes. This behaviour is observed because this workload does many in-place overwrites and OptFS performs selective data journaling. The authors also studied the impact of Journal Size on performance and here again OptFS's performance lies between the performance bounds of ext4 w and w/o flush. OptFS performs poorly(near ext4 w flushes) for small Journal sizes but performs well for reasonably sized journals of size 512 MB and greater. OptFS uses more CPU than ext4 ordered mode without flushes and this can be attributed to CRC32 checksumming and the background clean-up thread. The memory footprint has also gone up as OptFS holds metadata in-memory longer. The authors clearly need to provide more details describing the resource consumption as these results are meagre and insufficient. The authors also present two case studies (Atomic Updates within gedit and Temporary Logging in SQLite). OptFS has a 40X improvement over ext4 w/o flushes on the gedit study and 10X improvement over ext4 with flushes on the SQLite study.

5. Confusion
What happens to the system when ADNs are lost ??
How does the implementation detect that a data-block is going to be re-used for selective data journaling ??
The notion of handling data blocks (Sec 5.2) in OptFS is confusing and additional explanation would help clear the air on this.
More details about FUA and tagged queueing would be interesting.


1. Summary
The paper describes a new approach for maintaining file system consistency after a crash in journaling file systems. The approach is called Optimistic Crash Consistency and provides high performance with strong consistency.

2. Problem
Modern journaling file systems provide consistency by using the flush commands exposed by disks. These commands flush all the data in disks cache irrespective of the requirements of the client and are inefficient (around 5x performance degradation as compared to no flushing), such file systems which use flushing commands to conflate ordering and durability are said to use pessimistic crash consistency. Certain journaling systems disable flushing to provide probabilistic crash consistency, which means that if the system crashes under certain conditions the file system data structures might end up in an inconsistent state. This probabilistic approach is not suitable where consistency guarantees are required.

3. Contributions
The main contribution of this work is separating ordering from durability. The OptFS which was developed as part of this research trades the freshness of data for performance while maintaining strong consistency guarantees. One of the important primitives developed for separating ordering from persistence is the asynchronous durability notification which is a notification received from a disk after it completes writing data to the persistent storage. Along with this notification other techniques were introduced by this research for optimistic consistency; In-Order Journal recovery is the mechanism which is applied during crash recovery which states that if the recovery code encounters any invalid transaction then it should not replay any transaction after that. In-Order Journal release is a simple mechanism which is used to preserve transaction data in journal till the corresponding metadata is checkpointed. OptFS also maintains checksums of the data and metadata and stores them in the commit block which makes sure that when the crash occurs and the stored checksum and calculated checksum does not match then the transaction can be discarded. Background write after notification helps to increase the application performance as checkpointing is done in the background. Reuse after notification and Selective data journaling techniques are used to handle cases where the blocks are deleted in one transaction then re-used in other transaction before the journaling is durable and where the blocks are updated in place.

4. Evaluation
This paper does a very good job in evaluating the contemporary systems and the newly developed system. Initial sections show how systems with pessimistic consistency are lacking in performance as compared to systems which just have probabilistic consistency guarantees. They also quantify probabilistic consistency in a very elegant formula and evaluate various factors like workload, disk queue size and the journal layout which can affect the probabilistic guarantees. The experiments show that as queue size increases and the distance between data and journal decreases the probability of inconsistency increases even as the i/o rate increases highlighting the tradeoff between i/o rate and consistency. The final evaluation of the OptFS is done by approximating the asynchronous notifications using durability timeouts, which is understandable as it is not possible to change the disk firmware to add the support. On this filesystem the authors run and show the results of micro-bencmarks as well as real world workloads, the evaluation shows that OptFS does provide strong consistency and improved performance over a pessimistic system but in the case of pathological workload of sequential overwrites it tends to perform about 0.5X times a pessimistic and a probabilistic system. I was wondering how the system would have performed if the disk interrupted for durability as well because then the system would have to deal with 2X the interrupts

5. Confusion
In section 4 the paper talked about "backpointers" in journals, what are those?

1. Summary
This paper introduces an optimistic way to deal with crash consistency, which continues to prevent inconsistency while performing better than current consistency models in systems, such as ext4.
2. Problem
The problem is that disks take write requests, but they only notify that they have received the requests, not that the data is persisted. After the request is received, the disk writes the data persistently but it might reorder the blocks for better performance. Because of this reordering, a journaled FS might have some of each transaction persisted but not all. As a result, all those transactions are useless. In order to make sure things are persisted, programs can call fsync() to force disk to persist the data, but fsync() persists everything in its cache, not just the specific block(s). As a result, this pessimistic approach hinders performance greatly.
3. Contributions
The goal of the FS is to commit transactions in such a way that consistency is achieved at the same level as pessimistic journaling, while providing the same performance as probabilistic journaling that relies on the disk keeping the correct ordering as the journal is a contiguous space.
The main idea behind OptFS is the decoupling of ordering and durability, and the FS uses two main methods to preserve consistency as well as performance: 1. Using checksums to eliminate disk-write ordering needs, and 2. Requiring two notifications from the disk for write requests: 1) when the write request is received and is in cache, 2) when the request is finally persisted on disk.
The FS preserves 2 properties: a transaction cannot be observed unless every previous transaction can be observed, and it is not possible for a metadata to point to invalid data.
As a result, the authors include several techniques to be optimistic in order to be fast, yet be consistent. These techniques include in-order journal recovery, background write after notification, and selective data journaling.
Even though the FS continues to offer fsync() system call, it also provides osync() to let applications preserve orders of write requests as well as dsync() to ensure that when system call returns the writes are persisted.
4. Evaluation
The goal of OptFS is to be fast while still providing consistency guarantees. For that reason, the authors evaluate both the reliability of the FS as well as performance. Regarding reliability, the authors run 400 different crash cases and they show that OptFS recovers correctly in all of them to some extent of the data. However, they do not mention how much data was lost because of inconsistency of recent transaction orderings. I believe this is an important evaluation because you are being optimistic and need to know how much new data would be lost if a crash does happen.
Regarding performance, the authors do extensive evaluations with diverse workloads and show that they are able to perform much better than ext4 in ordered mode with flushes, but turns out that OptFS is not the best for workloads with mainly sequential writes. The authors also modify 2 applications (gedit and SQLite) to utilize the new system calls, where they show that improvements over previous FS, and this is a great way to evaluate a FS because it is using popular applications that can benefit greatly.
5. Confusion
Have disk manufacturers adopted the new notification design to improve performance of FSes, to relieve them from guessing the estimated time to durability?

1. summary
Optimistic Crash Consistency is built upon checksum and delayed write operation, notification, to achieve better performance with consistency. Checksum makes it possible to provide consistency to metadata and commit in journal because if one of the data is not written into disk, the checksum is not matched. Delayed write keeps the order between D|Jc|Jm and metadata to prevent inconsistency caused by re-ordering. The evaluation results show that it keeps consistency on crash conditions and the performance is greater than ext4 with flush or identical to ext4 without flush.

2. Problem
Write buffering in disk causes reordering of write, even operating system send messages in order. It may result in inconsistency on a disk crash depending on when the crash happens. Therefore, many file system uses flush to provide consistency even on crash, but the flush is very expensive operation in terms of performance because all the data should be in order. On the other hand, the disk operates can be fast when file system does not use flush operation while it may cause inconsistency on crash.

3. Contributions
Main contribution here is to build fast file system with consistency using checksum and notification. As explained in problem, the flush operation has pros and cons therefore, it is hard to achieve high performance with consistency with flush.
In order to keep consistency in ext4 system, the flush operation should be inserted whenever the operation should be executed in order, especially between Jm and Jc, and between Jc and M, and after M. There are 3 flush operations in a disk write operation.
First flush operation can be eliminated by using checksum of Jm and Jc. If checksum result is not matched with expectation result, it means the Jm is not written on disk due to reordering.
Second flush operation can be removed by using the notification, which is not supported yet though. On current file system has a notification on receiving write command from operating system while OptFS add an additional notification to this. The metadata is updated after all the previous data, D, Jm and JC, are updated with notification signal.
Third flush operation can be removed by ordered recovery and update. If the write is reordered among 5 write operation and 2nd write does not have or complete its operation, the rest operations from 3rd to 5th write operations are discarded.

4. Evaluation
The author evaluated the current file system about how much fast and how much vulnerable to crash. The result showed that no flush provide best performance and many write operation make the disk inconsistent. In addition the probability of inconsistency increases as the size of write queues increase.
The OptFS provides better performance on sequential and random write and create files except sequential overwrite. Because OptFS uses data journaling, which needs twice write operation, overwrite is worse than conventional file system. The results using macro-benchmark program are also same.
There are results about how much overhead are there in OptFS. The OptFS has the functions of checksum and delaying checkpointing, which make the performance or memory consumption worse respectively.
Lastly, the author evaluate OptFS with Gedit and SQLite on crash condition. The results show that there is no inconsistency state on ext4 with flush and OptFS and performance of OptFS is at least 40 times faster than ext4 with flush.

5. Confusion
Current disk can not generate second notification, which signals after D, Jm and Jc are done. Right?

Post a comment