We thank the reviewers for their detailed feedback, and take encouragement from their words: "The idea of slightly tweaking power management code in drivers to save and restore device is so clever that we should accept the paper for that alone." (Review 1), "The paper presents a fresh new-look to driver and device recovery" (Review 4), and "The authors of this paper have done an impressive amount of engineering" (Review 3). Our primary contribution is re-using existing power management code to provide device checkpoints, and have designed a fault tolerance system based on it (as identified by Reviews 1,2,4,5). We first mention the common concerns and then address unique review specific questions. Review 1 is not clear about why the cost of checkpoints so low when compared to device restart. We briefly mention this in Section 4.2; the cost is low since FGFT skips cold booting the device, and the complex device initialization code, i.e. device probe, which exchanges information with the device to detect the device type, features and sets up driver and kernel data structures. The resume code, which we re-purpose is light-weight and does not share the burden of re-determining device model and features. We will describe this in detail in subsequent revisions of the paper. Review 1,2 and 3 ask what types of faults we handle and question if they are more limited than related work (Nooks). FGFT traps on all processor exceptions (NULL pointer exception, general protection fault, alignment fault, divide error (divide by zero), missing segments, and stack faults) apart from memory errors. One can also check for additional invariants during marshaling (we do not). The range of faults handled is the same as Nooks (Review 2), but we are much more fine grained in our memory protection due to size information of data structures available during marshaling. Our fault injection tests (5.1) used different bug types (Table 2), which manifest as memory violations or as one of the above processor exceptions. Review 3: "This reviewer believes the paper should be rejected because it is long on engineering and short on science." We politely disagree. Our novel contribution is device checkpointing, which can be used for variety of uses apart from fault tolerance (Table 1). Furthermore, to clearly demonstrate its value and overheads, we implemented a driver isolation and a driver/device recovery solution, which has made the paper heavy on implementation and engineering. In subsequent revisions of the paper, we will describe the our research contributions of device checkpoints and in-kernel SFI using marshaling better. However, we also believe that rigorous engineering is one of our important contributions. Review 1,3,4,5 discuss selection isolation and where it is useful. FGFT can reuse the wealth of static analysis or dynamic instrumentation tools to identify buggy or vulnerable code such as rarely used ioctls or recovery code. FGFT can run this code in isolation w/o affecting the core I/O path. In our evaluation(5.4), we show that only 18% of all entry points are buggy. Furthermore, past work [1], that looks at moving code to user mode, has shown that bug density is skewed towards non-I/O code since I/O paths are generally well tested. Moreover, I/O code comprises a surprisingly small fraction of total driver code[1]. In subsequent revisions of the paper, we will demonstrate this with an example (Review 1). Reviews 2,3,4 discuss our synchronization policy. FGFT uses lazy version management, that acquires any device or driver locks required by the entry point until the transaction commits (or fails). Since FGFT safely shares locks with rest of the threads, it needs to ensure that the changes made by isolated thread are consistent and do not conflict when merged. Hence, it with holds releasing any driver or device locks until the transaction commits. Review 2 compares us with past systems TxOS (SOSP 09), TxLinux (SOSP 07) and xCalls (Eurosys 09). These systems either abort on I/O or limit themselves to memory checkpoints and cannot do device checkpoints. Other isolation/recovery systems mentioned; Nooks, Mondrix, Recovery Domains (ASPLOS '09) do not do device checkpoints. We believe device checkpointing is an useful contribution and would be excited to see how it is applied to other applications. --- We now discuss important concerns raised by reviewers not discussed above: Review 1: (Weak Accept) All concerns where discussed above. Review 2: (Weak Accept) "Also, there appear to be limitations in supporting disks that the authors gloss over (with a reference to Membrane [39]. Finally, it is a bit suspicious that the authors used a 3 year old kernel (2.6.29 was released 3/09) for their evaluation. Does this hint at the difficult of keeping CIL or their SFI infrastructure up to date with the kernel source tree?" Concern over CIL and SFI to be used on newer kernels: Our infrastructure has no such limitation on a particular version. CIL has been used on recent kernels. "I am most concerned with how FGFT must take exclusive access to the device to take a checkpoint. I would like to see a benchmark where say 10 or 100 independent threads are running netperf. The actual locking that must take place, especially during device callbacks is ad-hoc often difficult to determine. That makes me nervous about how difficult it is to use FGFT in practice. Where is the USB mass storage device? Is its absence due to the problem of persistent storage(4.1.4)? Or performance overhead of copyin/out (sec 3)? Is overhead more for storage devices? The performance overhead will be highest for devices that call into the device most frequently. We tested network drivers because they call in very frequently (~70K times/s) while only transferring a small amount of data (1 packet). Hence, network device performance should be most sensitive to performance. Other systems that creates copies of data like Nooks also observe the same behavior. In general, FGFT has this problem of locks acquired during callbacks, or memory allocation, or another action that would require a semantic rollback (if there are any). If the paper dealt with these limitations straightforwardly, I'd be inclined to dismiss them as technical details. But the paper is an odd blend of clear technical writing and explanation of tradeoffs (4.1.4 Discussion, 3.2 failure detection) and sales-y obfuscation (the persistent storage issue, converting sleeps to busy waiting (4.1.3)). It makes it difficult to trust the paper's assumptions. Case in point. "The above mechanism protects shared structures across different driver threads. However, the suspicious thread can also block waiting for data to arrive on shared structures that have been copied over from other driver threads. In such cases, FGFT requires extra annotation to re-synchronize shared data." This paragraph appears to hide a world of caveat. The problem of arbitrary resynchronization seems very difficult. Can you please analyze the drivers (even classes you didn't evaluate) to convince me that this isn't a hopeless task for entire classes of drivers?" In (3.1.2) you say that any lock grabbed during a kernel callback is held until exit from the driver. Yikes! You have just modified the kernel locking convention, and how to you guarantee you won't deadlock? You just have to understand the driver and callback behavior well enough to know you won't and that concerns me. I also don't understand how copyin/out can work if there are multiple threads ever let into the driver, even after configuration. Can't these threads take different amounts of time and then clobber each other's updates? What about compensation actions occurring while another part of the driver is executing and adding compensation actions? How does FGFT manage locking? FGFT synchronizes using existing locks that are provided as "read only" to the isolated module. Any locking is done using kernel API calls, and is recorded in kernel log to rollback in case of a failure. The above statement is a special hypothetical case, where an isolated thread can block waiting for data to arrive in a shared structure used across threads. In such cases, the structure will need to be annotated to ensure the structure is shared with read access using a special annotation. We can analyze in existing drivers to identify how commonly is this synchronization primitive used. Review 3: Reject Assumptions in the paper/system that are not addressed or validated in the paper (in priority order): 1) Memory safety violations are the primary cause of driver failures. This neglects other causes of driver failure including race conditions, lock inversions, state machine errors, errors in logic, etc. Key unanswered question: what fraction of driver failures are cause by memory safety violations? Types of faults handled: As mentioned above, we handle all faults through processor exceptions and additionally detect fine grained memory violations. All bugs that end up as faults within the driver entrypoint execution are detected and cleaned up. Any bug that translates into a fault using memory violation or above exceptions (like stack corruption) will be detected. FGFT cannot detect hangs (which most related work doesn't). But a watchdog can be used to detect hangs and abort an FGFT transaction. Table 2 describes that apart from memory violations, we also tested other bugs like missing parameters or corruption in expressions. 2) SFI isolation can automatically separate locking and ordering operations for memory accesses. Key unanswered question: how does isolator identify and reactor locking operations. We use static analyses to detect all memory and locking operations in suspect code. 3) Power management code can practically be refactored into checkpointing and restore code. Figure 2 shows logically how the power management code must be refactored. The paper seems to imply that this refactoring can be done automatically. Key questions: is refactoring power management code into checkpoint and restore code automatically possible? if it can't be done automatically, how much domain and device expertise is required to do it manually? Concern over developer effort in re-factoring power management code: We argue that any driver developer can easily make export checkpoint/restore in their driver with very little re-engineering of existing driver if it supports power management. This feature is considered to special hardware and driver features, but we show that it requires little manual effort(Section 5.FIXME) since suspend/resume code already contains this code. We agree that an untrained developer, new to a driver may experience difficulty if the suspend/resume is not straightforward. 4)Analysis for a small number of drivers (6) for a small number of device classes (3) can be generalized to all drivers. Key unanswered questions: if the system is so easy to apply, why wasn't it applied to 60 or 600 drivers instead of just 6? what about more complex drivers like queuing storage drivers or graphics drivers? Concern over other devices/complex devices. We cannot do 600 drivers because we need physical devices to test each driver. We cover PCI and USB device classes, which account for a significant portion of total devices supported in the kernel. Furthermore, through static analysis we also identify the drivers that support power management(Table 7) . In our subsequent revisions, we plan to discuss applicability to complex devices. 5) Check pointing (based on power management) imposes no serialization or ordering issues. Power management is generally well serialized against execution (by the OS). Key unanswered question: does refactoring of power management code for checkpointing violate any ordering assumptions in the code? Any special ordering requirement for device recovery code: We do not impose any ordering assumption. However, this code may be invoked in interrupt/atomic contexts and hence the code may require to be modified to ensure this is safe. We discuss this in Section 4.1.3. 6) Errors in driver code are not in driver power management code. Key unanswered question: if power management code has errors, won't these errors be propagated (either manually or automatically) to checkpointing code? Bugs in power management code: This is a limitation of applying device checkpoints to fault tolerance. However, much of device checkpoint/restore is often shared with initialization code (which is tested). Also, other applications of device checkpoints will not suffer from this limitation. Review 4: Accept 1. On major qualm that I have about the paper is that despite knowing that the driver is faulty (which is the reason why it is being recovered), FGFT relies on other parts of the driver itself for recovery. This makes the assumption in section 3 about "driver code used for recovery cannot be isolated and must be trusted" a little uncomforable for me. In contrast, schemes like Nooks, etc. have explicit external logging and replay mechanisms to orchestrate this recovery. Can the authors comment about this dependence that they introduce? Bugs in power management code: We only require suspend/resume to be fault free, which subsumes a small portion of device initialization under common case. In order to have a driver up and running, this code should be bug free. However, FGFT simply unloads the driver if the device restore fails. 2. I am a little confused about the fault-model described in section 2.1. From my understanding of paragraph 4 in section 2.1, I gather that you are assuming every driver invocation is state-less, i.e., one invocation does not affect the next. However, consider the following rather-common scenario where you have a server and muliple worker threads have been spwaned to handle multiple client requests. Let us assume that thread 1 and thread 2 are ready to send files through the network. As per TCP/IP semantics, the driver invocations will have to be on a per-packet basis and therefore, the two threads invoking the driver will be interspersed with one another before the entire file is transfered. However, won't the driver have to keep state of each threads' invocations to make sure that it correctly generates the packets? Doesn't this make the driver stateful? Can you please clarify what I am missing here? Concern over states across threads during file transfer: Our goal is to run each thread like a transaction. The packet transmit entrypoint simply sends a packet (and modifies appropriate statistics) while the accounting is done by the file transfer application. With FGFT, the driver thread fails and returns an error to the application. At this point, the application can fail the transfer or re-spawn the particular thread. Without FGFT, a thread can crash the system, or worse corrupt and hang making the entry point unavailable until the system is rebooted. 3. The FGFT scheme requires that driver-state and kernel-state touched by the driver is explicitly copied to enable recovery. While the driver-state touched is explicitly annotated by the user, it is unclear how the kernel-state touched is identified. While there is some description about ensuring that only fields of structures, and not the entire structure, touched by the driver are copied, it is unclear how this works in practice. Can the authors provide an example of the kinds of structures that were touched and which fields were copied, as opposed to the entire structure? An example describing copying of fields (and not structures): For example, if the driver issues an ioctl that updates driver internal private structure (usually pointed to by struct netdev->priv, where netdev is kernel's netdevice). In such cases, FGFT will use points-to-analysis and statically pre-determine the fields touched, such as netdev->priv->tx_ring and netdev->priv->rx_ring and only generate marshaling code to copy in/out these parameters. This reduces marhsaling code and unnecessary copying. 5. How were the time-related measurements in section 5.3 done? What is the error margin of the measurement? Timing measurement detail: We used the TSC processor register to get the timestamp values, (rdtscll calls) which is used for extremely high precision for short intervals. We did an average of 5 runs. Review 5: Weak Reject - cost of protection: 20+ us. In other words, the approach adds 60,000 cycles to each driver entry point that needs to be protected. We agree. However, this cost is much lesser than logging each and every device operation across all driver calls (not just the failing one), to restore the device configuration correctly. - knowing what to protect (the method is probably too expensive to protect everything). - which class of bugs does it actually help against (e.g., would a restarted driver after checkpoint resumption just fail again the same way, in the common case?). The method seems to only work for heisenbugs (flaky hardware and such), but then I wonder if I would really want to run my compute services on broken hardware. Can it protect more than flaky hardware bugs?: We can protect against flaky hardware, or entry points that fail on specific inputs (buffer overflow), or against infrequently used code (a rarely invoked ioctl code does not make the common case inoperable or even slower). Instead of crashing the system or making the device inoperable by a thread crashing with device/driver locks, FGFT ensures that other code paths of device remain operable (Section 5.1). - is there an assumption about fully serialized drivers (checkpointing requires that no thread is executing anywhere in the driver at checkpoint time). Put differently, do the performance data presented account for the possibility that the method forces more serialization than would otherwise be needed for the driver (e.g., multiqueue nics, etc). - what assumptions are made about drivers for stateful devices like disks where a checkpoint can't include the data being written to disk? In the case of drivers with persistent internal state, such as disks and other storage devices, restore will only restore the transient device state and not the persistent state, such as the contents of files. As a result, use of checkpoints must be coordinated with higher-level recovery mechanisms, such as Membrane [39], to keep persistent data consistent. [1] Microdrivers: A New Architecture for Device Drivers, HotOS 2007. === - I am not clear about multi-queue NICs. My guess was this should not cause any more problems than standard network drivers. - I am also not clear about the special case of acquiring a lock on a shared structure and then waiting for data to arrive on it. Since, all locks are released at the end, this can deadlock.