We thank the reviewers for their detailed feedback, and take encouragement from their
words: "The idea of slightly tweaking power management code in drivers to save and restore
device is so clever that we should accept the paper for that alone." (Review 1), "The
paper presents a fresh new-look to driver and device recovery" (Review 4), and "The
authors of this paper have done an impressive amount of engineering" (Review 3).

Our primary contribution is re-using existing power management code to provide device
checkpoints, and have designed a fault tolerance system based on it (as identified by
Reviews 1,2,4,5).   We first mention the common concerns and then address unique review
specific questions.

Review 1 is not clear about why the cost of checkpoints so low when compared to device
restart. We briefly mention this in Section 4.2; the cost is low since FGFT skips cold
booting the device, and the complex device initialization code, i.e. device probe, which
exchanges information with the device to detect the device type, features and sets up
driver and kernel data structures. The resume code, which we re-purpose is light-weight
and does not share the burden of re-determining device model and features. We will
describe this in detail in subsequent revisions of the paper.

Review 1,2 and 3 ask what types of faults we handle and question if they are more limited
than related work (Nooks). FGFT traps on all processor exceptions (NULL pointer exception,
general protection fault, alignment fault, divide error (divide by zero), missing
segments, and stack faults) apart from memory errors.  One can also check for additional
invariants during marshaling (we do not). The range of faults handled is the same as Nooks
(Review 2), but we are much more fine grained in our memory protection due to size
information of data structures available during marshaling. Our fault injection tests
(5.1) used different bug types (Table 2), which manifest as memory violations or as one of
the above processor exceptions.

Review 3: "This reviewer believes the paper should be rejected because it is long on
engineering and short on science." We politely disagree. Our novel contribution is device
checkpointing, which can be used for variety of uses apart from fault tolerance (Table 1).
Furthermore, to clearly demonstrate its value and overheads, we implemented a driver
isolation and a driver/device recovery solution, which has made the paper heavy on
implementation and engineering. In subsequent revisions of the paper, we will describe the
our research contributions of device checkpoints and in-kernel SFI using marshaling
better. However, we also believe that rigorous engineering is one of our important
contributions.

Review 1,3,4,5 discuss selection isolation and where it is useful. FGFT can reuse the
wealth of static analysis or dynamic instrumentation tools to identify buggy or vulnerable
code such as rarely used ioctls or recovery code. FGFT can run this code in isolation w/o
affecting the core I/O path. In our evaluation(5.4), we show that only 18% of all entry
points are buggy. Furthermore, past work [1], that looks at moving code to user mode, has
shown that bug density is skewed towards non-I/O code since I/O paths are generally well
tested. Moreover, I/O code comprises a surprisingly small fraction of total driver
code[1]. In subsequent revisions of the paper, we will demonstrate this with an example
(Review 1).

Reviews 2,3,4 discuss our synchronization policy. FGFT uses lazy version management, that
acquires any device or driver locks required by the entry point until the transaction
commits (or fails). Since FGFT safely shares locks with rest of the threads, it needs to
ensure that the changes made by isolated thread are consistent and do not conflict when
merged. Hence, it with holds releasing any driver or device locks until the transaction
commits.

Review 2 compares us with past systems TxOS (SOSP 09), TxLinux (SOSP 07) and
xCalls (Eurosys 09). These systems either abort on I/O or limit themselves to memory
checkpoints and cannot do device checkpoints. Other isolation/recovery systems mentioned; 
Nooks, Mondrix, Recovery Domains (ASPLOS '09) do not do device checkpoints. We 
believe device checkpointing is an useful contribution and would be excited to see how 
it is applied to other applications.

---

We now discuss important concerns raised by reviewers not discussed above:

Review 1: (Weak Accept)

All concerns where discussed above.

Review 2: (Weak Accept)

"Also, there appear to be limitations in supporting disks that the authors gloss over
(with a reference to Membrane [39]. Finally, it is a bit suspicious that the authors used
a 3 year old kernel (2.6.29 was released 3/09) for their evaluation.  Does this hint at
the difficult of keeping CIL or their SFI infrastructure up to date with the kernel source
tree?"

Concern over CIL and SFI to be used on newer kernels: Our infrastructure has no such
limitation on a particular version. CIL has been used on recent kernels.

"I am most concerned with how FGFT must take exclusive access to the device to take a
checkpoint.  I would like to see a benchmark where say 10 or 100 independent threads are
running netperf.  The actual locking that must take place, especially during device
callbacks is ad-hoc often difficult to determine.  That makes me nervous about how
difficult it is to use FGFT in practice.  Where is the USB mass storage device?  Is its
absence due to the problem of persistent storage(4.1.4)?  Or performance overhead of
copyin/out (sec 3)?

Is overhead more for storage devices?

The performance overhead will be highest for devices that call into the device most
frequently. We tested network drivers because they call in very frequently (~70K times/s)
while only transferring a small amount of data (1 packet). Hence, network device
performance should be most sensitive to performance. Other systems that creates copies of
data like Nooks also observe the same behavior.


In general, FGFT has this problem of locks acquired during callbacks, or memory
allocation, or another action that would require a semantic rollback (if there are any).
If the paper dealt with these limitations straightforwardly, I'd be inclined to dismiss
them as technical details.  But the paper is an odd blend of clear technical writing and
explanation of tradeoffs (4.1.4 Discussion, 3.2 failure detection) and sales-y obfuscation
(the persistent storage issue, converting sleeps to busy waiting (4.1.3)).  It makes it
difficult to trust the paper's assumptions.

Case in point.

"The above mechanism protects shared structures across different driver threads. However,
the suspicious thread can also block waiting for data to arrive on shared structures that
have been copied over from other driver threads. In such cases, FGFT requires extra
annotation to re-synchronize shared data."

This paragraph appears to hide a world of caveat.  The problem of arbitrary
resynchronization seems very difficult.  Can you please analyze the drivers (even classes
you didn't evaluate) to convince me that this isn't a hopeless task for entire classes of
drivers?"

In (3.1.2) you say that any lock grabbed during a kernel callback is held until exit from
the driver.  Yikes!  You have just modified the kernel locking convention, and how to you
guarantee you won't deadlock?  You just have to understand the driver and callback
behavior well enough to know you won't and that concerns me.

I also don't understand how copyin/out can work if there are multiple threads ever let
into the driver, even after configuration.  Can't these threads take different amounts of
time and then clobber each other's updates?  What about compensation actions occurring
while another part of the driver is executing and adding compensation actions?


How does FGFT manage locking?

FGFT synchronizes using existing locks that are provided as "read only" to the isolated
module. Any locking is done using kernel API calls, and is recorded in kernel log to
rollback in case of a failure. The above statement is a special hypothetical case, where
an isolated thread can block waiting for data to arrive in a shared structure used across
threads. In such cases, the structure will need to be annotated to ensure the structure is
shared with read access using a special annotation. We can analyze in existing drivers to
identify how commonly is this synchronization primitive used.


Review 3: Reject


Assumptions in the paper/system that are not addressed or validated in the paper (in
priority order):

1) Memory safety violations are the primary cause of driver failures. This neglects other
causes of driver failure including race conditions, lock inversions, state machine errors,
errors in logic, etc.  Key unanswered question: what fraction of driver failures are cause
by memory safety violations?

Types of faults handled:

As mentioned above, we handle all faults through processor exceptions and additionally
detect fine grained memory violations. All bugs that end up as faults within the driver
entrypoint execution are detected and cleaned up. Any bug that translates into a fault
using memory violation or above exceptions (like stack corruption) will be detected. FGFT
cannot detect hangs (which most related work doesn't). But a watchdog can be used to
detect hangs and abort an FGFT transaction. Table 2 describes that apart from memory
violations, we also tested other bugs like missing parameters or corruption in
expressions.


2) SFI isolation can automatically separate locking and ordering operations for memory
accesses.  Key unanswered question: how does isolator identify and reactor locking
operations.

We use static analyses to detect all memory and locking operations in suspect code.

3) Power management code can practically be refactored into checkpointing and restore
code.  Figure 2 shows logically how the power management code must be refactored.  The
paper seems to imply that this refactoring can be done automatically.  Key questions: is
refactoring power management code into checkpoint and restore code automatically possible?
if it can't be done automatically, how much domain and device expertise is required to do
it manually?

Concern over developer effort in re-factoring power management code:

We argue that any driver developer can easily make export checkpoint/restore in their
driver with very little re-engineering of existing driver if it supports power management.
This feature is considered to special hardware and driver features, but we show that it
requires little manual effort(Section 5.FIXME) since suspend/resume code already contains
this code. We agree that an untrained developer, new to a driver may experience difficulty
if the suspend/resume is not straightforward.


4)Analysis for a small number of drivers (6) for a small number of device classes (3) can
be generalized to all drivers.  Key unanswered questions: if the system is so easy to
apply, why wasn't it applied to 60 or 600 drivers instead of just 6?  what about more
complex drivers like queuing storage drivers or graphics drivers?

Concern over other devices/complex devices.

We cannot do 600 drivers because we need physical devices to test each driver. We cover
PCI and USB device classes, which account for a significant portion of total devices
supported in the kernel. Furthermore, through static analysis we also identify the drivers
that support power management(Table 7) . In our subsequent revisions, we plan to discuss
applicability to complex devices.


5) Check pointing (based on power management) imposes no serialization or ordering issues.
Power management is generally well serialized against execution (by the OS).  Key
unanswered question: does refactoring of power management code for checkpointing violate
any ordering assumptions in the code?

Any special ordering requirement for device recovery code:

We do not impose any ordering assumption. However, this  code may be invoked in
interrupt/atomic contexts and hence the code may require to be modified to ensure this is
safe. We discuss this in Section 4.1.3.


6) Errors in driver code are not in driver power management code.  Key unanswered
question: if power management code has errors, won't these errors be propagated (either
manually or automatically) to checkpointing code?

Bugs in power management code:

This is a limitation of applying device checkpoints to fault tolerance. However, much of
device checkpoint/restore is often shared with initialization code (which is tested).
Also, other applications of device checkpoints will not suffer from this limitation.


Review 4: Accept

1. On major qualm that I have about the paper is that despite knowing that the driver is
faulty (which is the reason why it is being recovered), FGFT relies on other parts of the
driver itself for recovery. This makes the assumption in section 3 about "driver code used
for recovery cannot be isolated and must be trusted" a little uncomforable for me. In
contrast, schemes like Nooks, etc. have explicit external logging and replay mechanisms to
orchestrate this recovery. Can the authors comment about this dependence that they
introduce?

Bugs in power management code:

We only require suspend/resume to be fault free, which subsumes a small portion of device
initialization under common case. In order to have a driver up and running, this code
should be bug free. However, FGFT simply unloads the driver if the device restore fails.


2. I am a little confused about the fault-model described in section 2.1. From my
understanding of paragraph 4 in section 2.1, I gather that you are assuming every driver
invocation is state-less, i.e., one invocation does not affect the next. However, consider
the following rather-common scenario where you have a server and muliple worker threads
have been spwaned to handle multiple client requests. Let us assume that thread 1 and
thread 2 are ready to send files through the network. As per TCP/IP semantics, the driver
invocations will have to be on a per-packet basis and therefore, the two threads invoking
the driver will be interspersed with one another before the entire file is transfered.
However, won't the driver have to keep state of each threads' invocations to make sure
that it correctly generates the packets? Doesn't this make the driver stateful? Can you
please clarify what I am missing here?

Concern over states across threads during file transfer:

Our goal is to run each thread like a transaction. The packet transmit entrypoint simply
sends a packet (and modifies appropriate statistics) while the accounting is done by the
file transfer application. With FGFT, the driver thread fails and returns an error to the
application. At this point, the application can fail the transfer or re-spawn the
particular thread. Without FGFT, a thread can crash the system, or worse corrupt  and hang
making the entry point unavailable until the system is rebooted.


3. The FGFT scheme requires that driver-state and kernel-state touched by the driver is
explicitly copied to enable recovery. While the driver-state touched is explicitly
annotated by the user, it is unclear how the kernel-state touched is identified. While
there is some description about ensuring that only fields of structures, and not the
entire structure, touched by the driver are copied, it is unclear how this works in
practice. Can the authors provide an example of the kinds of structures that were touched
and which fields were copied, as opposed to the entire structure?

An example describing copying of fields (and not structures):

For example, if the driver issues an ioctl that updates driver internal private structure
(usually pointed to by struct netdev->priv, where netdev is kernel's netdevice). In such
cases, FGFT will use points-to-analysis and statically pre-determine the fields touched,
such as netdev->priv->tx_ring and netdev->priv->rx_ring and only generate marshaling code
to copy in/out these parameters. This reduces marhsaling code and unnecessary copying.


5. How were the time-related measurements in section 5.3 done? What is the error margin of
the measurement?

Timing measurement detail:

We used the TSC processor register to get the timestamp values, (rdtscll calls) which is
used for extremely high precision for short intervals. We did an average of 5 runs.

Review 5: Weak Reject


- cost of protection: 20+ us. In other words, the approach adds 60,000 cycles to each
driver entry point that needs to be protected.

We agree. However, this cost is much lesser than logging each and every device operation
across all driver calls (not just the failing one), to restore the device configuration
correctly.

- knowing what to protect (the method is probably too expensive to protect everything).

- which class of bugs does it actually help against (e.g., would a restarted driver after
checkpoint resumption just fail again the same way, in the common case?). The method seems
to only work for heisenbugs (flaky hardware and such), but then I wonder if I would really
want to run my compute services on broken hardware.

Can it protect more than flaky hardware bugs?:

We can protect against flaky hardware, or entry points that fail on specific inputs
(buffer overflow), or against infrequently used code (a rarely invoked ioctl code does not
make the common case inoperable or even slower). Instead of crashing the system or making
the device inoperable by a thread crashing with device/driver locks, FGFT ensures that
other code paths of device remain operable (Section 5.1).

- is there an assumption about fully serialized drivers (checkpointing requires that no
thread is executing anywhere in the driver at checkpoint time). Put differently, do the
performance data presented account for the possibility that the method forces more
serialization than would otherwise be needed for the driver (e.g., multiqueue nics, etc).


- what assumptions are made about drivers for stateful devices like disks where a
checkpoint can't include the data being written to disk?

In the case of drivers with persistent internal state, such as disks and other storage
devices, restore will only restore the transient device state and not the persistent
state, such as the contents of ﬁles. As a result, use of checkpoints must be coordinated
with higher-level recovery mechanisms, such as Membrane [39], to keep persistent data
consistent.


[1] Microdrivers: A New Architecture for Device Drivers, HotOS 2007.

===

- I am not clear about multi-queue NICs. My guess was this should not cause any more
problems than standard network drivers.

- I am also not clear about the special case of acquiring a lock on a shared structure and
then waiting for data to arrive on it. Since, all locks are released at the end, this can
deadlock.